prune_previous: avoid pruning immediate versions to better support concurrency#3028
Draft
jamesblackburn wants to merge 2 commits into
Draft
prune_previous: avoid pruning immediate versions to better support concurrency#3028jamesblackburn wants to merge 2 commits into
jamesblackburn wants to merge 2 commits into
Conversation
34b6df7 to
0e11fd4
Compare
b613b4b to
50ab49d
Compare
Collaborator
Author
|
Issue with as_of reads when a version is only held by snapshots: #3034 |
7bb7dd6 to
52a32be
Compare
52a32be to
43c576f
Compare
A write with prune_previous_version=True (or append/update/etc.) used to physically delete the previous versions' data as soon as it tombstoned them. A concurrent writer mid-append on one of those versions inherits its data segments; if the prune deleted them first, the appender committed a version referencing deleted data and became unreadable. Decouple tombstoning (immediate, unconditional - pruned versions vanish from readers right away, matching user intuition) from physical deletion (gated for concurrent-writer safety). The mechanism is a single-write fold: - Phase 1 (version_map::write_and_prune_previous): in one journal entry write the new TABLE_INDEX head, an individual TOMBSTONE for each retained previous version, and a single TOMBSTONE_ALL high-water mark. Retained = the anchor (latest undeleted previous version - the base an appender builds on, kept regardless of age) plus anything inside the PrunePreviousProtectionSecs window (default 600s). The TOMBSTONE_ALL is placed at floor-1 (floor = oldest retained), burying only versions strictly below it. The head stays the new TABLE_INDEX so the symbol ref's first key is an index key and the LATEST read fast-path is preserved. The anchor is computed from the loaded undeleted entry, not threaded from the caller, whose previous key can point at a tombstoned former head after a delete_version. - Phase 2 (local_versioned_engine::delete_unreferenced_pruned_indexes): reuses the phase-1 entry (no reload), splits the tombstoned indexes by the persisted line, physically deletes the buried block (except snapshotted keys), and protects the retained block. No version-map write. Skipped under delayed_deletes. The high-water line keeps prune/delete bounded: UNDELETED_ONLY loads stop at the line, so each prune reads only the retained region plus the live head, never total history. Aged-out retained versions are reclaimed by a later prune once the floor rises past them. The explicit admin prune_previous_versions() keeps master's aggressive behaviour (tombstone everything below latest + delete immediately, no protection window) - use the per-write flag when concurrent-writer safety is needed. Docs (VERSIONING.md, runtime_config.md, library.py/_store.py docstrings) updated to describe the two prune flavours and the single-write fold.
New tests: - test_prune_previous.py: protection-window and anchor behaviour, data-remains- in-storage-during-window, delete-all reclaims prune-retained versions, and snapshot safety. - test_concurrent_read_and_write.py::test_concurrent_read_write_eager_prune: a concurrent reader never sees NoDataFoundException while another process writes with prune_previous_version=True on the eager (non-delayed-delete) store. The read path does not retry on a missing key, so reader safety depends on the protection window; the test enables the 600s window before forking so the children inherit it (fails with window=0, passes with window=600). - test_version_map.cpp: PrunePreviousTombstonesAllPreExisting and the folded single-entry chain structure; version_map_model.hpp tombstone model. conftest.py adds the session-autouse _disable_prune_protection_window fixture (PrunePreviousProtectionSecs=0) so the existing suite exercises the anchor-only retention deterministically. Updated key-count / visibility assertions across the version-store tests to match the single-write fold: visible versions go back to 1 after a prune while storage-key counts reflect the retained anchor; the admin prune_previous_versions API prunes aggressively (latest only).
0a69d95 to
c15719e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix
prune_previousdata-loss race; bound prune/delete via TOMBSTONE_ALL high-water markProblem
Concurrent writers appending to the same base version could lose data: writer 1's
prune_previousphysically deleted the base version's data segments before writer 2had finished referencing them.
Fix — two-phase mark/sweep
version_map::write_and_prune_previous): write the new head plus anindividual
TOMBSTONEfor every live previous version in one journal entry. Prunedversions become present-but-invisible above the
TOMBSTONE_ALLline, so their datastays in storage and remains findable. Returns the loaded entry for reuse.
LocalVersionedEngine::delete_unreferenced_pruned_indexes): reuse that entryto collect the above-line tombstoned set, retain the anchor (just-superseded previous
head) plus anything inside the
PrunePreviousProtectionSecswindow, physically delete therest (with snapshot/dedup/append-inheritance protection), then advance the
TOMBSTONE_ALLline past the removed block.
The bounded
UNDELETED_ONLYload stops at the line, so prune/delete_allcost isproportional to retained data, not total history — a constant number of storage ops per
call. Under
delayed_deletesthe physical delete is skipped but the line still advances sothe chain stays bounded.
Also
delete_all_versionsreclaims prune-retained anchors viafinalize_tombstone_all_result(filtered to individually-tombstoned keys), preventing a leak when a whole symbol is deleted.
get_version_at_timenow finds snapshot-protected deleted versions when a live version alsoexists — prune leaves these behind more often now.
Behavioural change⚠️
prune_previousno longer physically removes the immediately-previous version straight away.It is tombstoned (invisible to
list_versions) but its data is retained as an anchor —reclaimed on a later prune once aged past
VersionStore.PrunePreviousProtectionSecs(default 600s; set to 0 to disable the window, anchor still applies). Steady-state on-disk
footprint is anchor + latest rather than just latest.
API
No signature changes to V1/V2 public APIs. Docstrings on all
prune_previous_versionsentrypoints updated to describe deferred deletion. New config:
VersionStore.PrunePreviousProtectionSecs.Tests
New
test_prune_previous.py(window retention, anchor retention, burst reclaim,append/dedup dangling-reference safety, delete-all reclaim + snapshot safety); updated C++
VersionMap/rapidcheck expectations; storage-op-count test now asserts constant per-prune
ops instead of a hardcoded number.