Skip to content

{184305174} Fix delfiles to clear ufid-hash before deleting files#5956

Open
chands10 wants to merge 1 commit into
bloomberg:mainfrom
chands10:delfiles
Open

{184305174} Fix delfiles to clear ufid-hash before deleting files#5956
chands10 wants to merge 1 commit into
bloomberg:mainfrom
chands10:delfiles

Conversation

@chands10
Copy link
Copy Markdown
Contributor

This fixes a regression introduced in commit 5802570 where ufid clearing was moved from bdb_del_file() to schema changes only (bdb_close_only_sc). This caused the standalone 'delfiles' command to leave open file descriptors in ufid-hash, especially on replicants that opened handles during recovery.

Root Cause:

  • When replicants apply log records during recovery, they open DB handles via ufid-hash and mark them with DB_AM_RECOVER flag
  • These handles are kept open for performance
  • When 'delfiles' command runs WITHOUT a schema change, bdb_del_file() directly calls dbremove() without clearing ufid-hash entries
  • The OS cannot release disk space because handles remain open
  • Files show as "deleted" in lsof output but space is not reclaimed

Impact:

  • Production databases accumulate "deleted" files over time
  • Requires weekly database bounces to reclaim space

The Fix:

  • Restore ufid clearing to bdb_del_file() before calling dbremove()
  • Respects gbl_clear_ufid_on_db_close tunable (default=1)
  • Opens the file briefly, calls clear_ufid_hash(), then closes it
  • This clears any recovery handles (DB_AM_RECOVER) from ufid-hash
  • Non-fatal if clearing fails - logs warning and continues

Test:

  • New test delfiles_ufid_leak.test validates delfiles WITHOUT schema change
  • Unlike berkdb_file_leaks_rep.test which does TRUNCATE before delfiles, this test reproduces the production scenario
  • Verifies no "deleted" files remain open after delfiles

To help us review your pull request, please consider providing an overview of the following:

  • What is the type of the change (bug fix, feature, documentation and etc.) ?
  • What are the current behavior and expected behavior, if this is a bugfix ?
  • What are the steps required to reproduce the bug, if this is a bugfix ?
  • What is the current behavior and new behavior, if this is a feature change or enhancement ?
  • [Optional] Why is the new behavior better than the current behavior, if this is a feature change ?

@chands10
Copy link
Copy Markdown
Contributor Author

test AI

This fixes a regression introduced in commit 5802570 where ufid clearing
was moved from bdb_del_file() to schema changes only (bdb_close_only_sc).
This caused the standalone 'delfiles' command to leave open file descriptors
in ufid-hash, especially on replicants that opened handles during recovery.

Root Cause:
- When replicants apply log records during recovery, they open DB handles
  via ufid-hash and mark them with DB_AM_RECOVER flag
- These handles are kept open for performance
- When 'delfiles' command runs WITHOUT a schema change, bdb_del_file()
  directly calls dbremove() without clearing ufid-hash entries
- The OS cannot release disk space because handles remain open
- Files show as "deleted" in lsof output but space is not reclaimed

Impact:
- Production databases accumulate "deleted" files over time
- Requires weekly database bounces to reclaim space

The Fix:
- Restore ufid clearing to bdb_del_file() before calling dbremove()
- Respects gbl_clear_ufid_on_db_close tunable (default=1)
- Opens the file briefly, calls clear_ufid_hash(), then closes it
- This clears any recovery handles (DB_AM_RECOVER) from ufid-hash
- Non-fatal if clearing fails - logs warning and continues

Test:
- New test delfiles_ufid_leak.test validates delfiles WITHOUT schema change
- Unlike berkdb_file_leaks_rep.test which does TRUNCATE before delfiles,
  this test reproduces the production scenario
- Verifies no "deleted" files remain open after delfiles

Signed-off-by: Salil Chandra <schandra107@bloomberg.net>
Copy link
Copy Markdown

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Error ⚠.
Regression testing: Success ✓.

The first 10 failing tests are:
consumer_non_atomic_default_consumer_generated **quarantined**
reco-ddlk-sql [timeout] **quarantined**

@riverszhang89
Copy link
Copy Markdown
Contributor

I don't think this is correct. delfiles runs on master only and has no effects on replicants. Can you please turn on track_db_open 1 and see what holds an open reference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants