Fix prune_previous_version race condition causing data loss under concurrent writes#3028
Draft
jamesblackburn wants to merge 2 commits into
Draft
Fix prune_previous_version race condition causing data loss under concurrent writes#3028jamesblackburn wants to merge 2 commits into
jamesblackburn wants to merge 2 commits into
Conversation
34b6df7 to
0e11fd4
Compare
bcfb34f to
b613b4b
Compare
…current writes When two writers concurrently append to the same symbol using prune_previous_version=True, a race condition can corrupt the second writer's version: Writer A commits V1, prunes V0 (deleting its data segments), then Writer B commits V2 as an append on top of V0, referencing V0's now-deleted segments. Two complementary protections are added to write_and_prune_previous(): 1. Protection window: versions younger than PrunePreviousProtectionSecs (default 600 s) are ineligible for pruning, giving concurrent writers time to commit before their base version is deleted. 2. Anchor rule: the newest eligible version is always kept alive as an anchor for concurrent writers still in flight. Pruning only fires when there are >= 2 eligible candidates; the newest becomes the anchor and the second-newest becomes the tombstone boundary. With eager deletion this means >= 2 live versions survive after a pruning write. When delayed_deletes (background_deletion) is active the anchor rule is skipped: the background tool performs its own reference-check before physically removing data, so the extra live version is unnecessary. The protection window still applies in both modes. Also fixes a pre-existing bug where the tombstone_all_key was constructed from previous_key rather than effective_tombstone_key, placing the logical deletion boundary one version too high. Behaviour change: with eager deletion, >= 2 versions survive after a pruning write instead of 1. The explicit prune_previous_versions() admin method is unchanged and continues to prune unconditionally. New config: VersionStore.PrunePreviousProtectionSecs (default 600).
b613b4b to
50ab49d
Compare
…tests When PrunePreviousProtectionSecs=0 the cutoff was current_timestamp()-0 which equals current_timestamp(). Versions written in the same nanosecond could have creation_ts >= cutoff and thus be incorrectly excluded from pruning. Fix: when protection_secs==0, skip the timestamp check entirely so all versions are always eligible (the intended "no protection" semantics). Update hypothesis model (_prune_previous_versions) to reflect that with delayed_deletes=True the anchor itself is the tombstone boundary, so ALL NORMAL versions including sole candidates are tombstoned. Update integration tests that relied on the old (broken) timestamp guard behaviour.
Collaborator
Author
|
Issue with as_of reads when a version is only held by snapshots: #3034 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix
prune_previous_versionrace condition causing data loss under concurrent writesProblem
When two writers concurrently write to the same symbol using
prune_previous_version=True, a race condition can corrupt one of the resultingversions:
writes concurrently.
segments inline.
data segments.
A parallel read faces the same risk: a reader that has fetched a version's index
key but not yet fetched its data segments can lose those segments to a concurrent
prune mid-read.
The root cause is that
delete_unreferenced_pruned_indexeshas no knowledge of anyin-flight writer or reader that still holds a reference to the data being deleted.
Fix
write_and_prune_previous()now calls a newget_prune_previous_boundary()helperthat enforces two safety rules before tombstoning anything.
Rule 1 — Protection window.
Any version younger than
VersionStore.PrunePreviousProtectionSecs(default: 600 s)is ineligible for pruning. This gives concurrent writers and readers a guaranteed
window to complete before their base version's data can be removed.
Rule 2 — Anchor rule (eager deletion only).
Among eligible versions, the newest is always kept alive as an anchor. If there is
only one eligible candidate it becomes the anchor and nothing is pruned. The anchor
ensures there is always a safe base for any writer that has already started building
on it.
Interaction with
background_deletionWith
EnterpriseLibraryOptions(background_deletion=True), the anchor rule isskipped. Physical deletion is deferred to the background deletion tool, which
performs its own reference-check before removing data — making the anchor
unnecessary. The protection window still filters which versions are eligible for
tombstoning.
Behaviour change
The explicit
lib.prune_previous_versions(symbol)method is not affected — itremoves all non-snapshotted versions unconditionally.
Storage impact
Eager deletion. One extra version's worth of data at steady state. For
full-replace writes this is one additional data copy; it is physically deleted on the
next prune write when the anchor becomes the tombstone boundary, so the overhead is
transient. For appends, the anchor's data segments are already protected by the
latest version's append chain, so the incremental cost is one extra index key.
Background deletion. Unchanged from before this fix.
Configuration
VersionStore.PrunePreviousProtectionSecs600prune_previous_versions=True. Does not affect the explicitprune_previous_versions()method.Set via
set_config_int("VersionStore.PrunePreviousProtectionSecs", value)orARCTICDB_VersionStore_PrunePreviousProtectionSecs_int.With eager deletion, both the time window and the anchor rule apply. Increase
this value if write operations in your environment can remain in-flight longer than
the default.
With background deletion, the anchor rule is not applied — all versions older
than the threshold are eligible for tombstoning. The background tool's
reference-check provides the safety guarantee instead.
Tests
C++ (
test_version_map.cpp)PrunePreviousProtectsBaseVersionForConcurrentWriters— deterministic replicationof the race: two writers share V0 as base; asserts V0's data survives so both
resulting versions are readable.
FollowingVersionChainWithWriteAndPrunePrevious— updated for anchor rule withPrunePreviousProtectionSecs=0.Python (
test_prune_previous.py)test_prune_previous_preserves_recent_versions— versions within the protectionwindow are not pruned.
test_prune_previous_single_preexisting_version_not_pruned— the sole pre-existingversion is never pruned.
test_prune_previous_prunes_when_old_enough— withPrunePreviousProtectionSecs=0, exactly 2 versions survive (anchor + latest).All existing prune-related tests updated for the new minimum of 2 live versions under
eager deletion.