Skip to content

Fix prune_previous_version race condition causing data loss under concurrent writes#3028

Draft
jamesblackburn wants to merge 2 commits into
masterfrom
fix-prune-previous
Draft

Fix prune_previous_version race condition causing data loss under concurrent writes#3028
jamesblackburn wants to merge 2 commits into
masterfrom
fix-prune-previous

Conversation

@jamesblackburn
Copy link
Copy Markdown
Collaborator

@jamesblackburn jamesblackburn commented Apr 14, 2026

Fix prune_previous_version race condition causing data loss under concurrent writes

Problem

When two writers concurrently write to the same symbol using
prune_previous_version=True, a race condition can corrupt one of the resulting
versions:

  1. Writers A and B both read V0 as the current head and begin constructing their
    writes concurrently.
  2. Writer A commits V1 and prunes V0 — tombstones V0 and physically deletes V0's data
    segments inline.
  3. Writer B also commits a version whose append chain references V0's now-deleted
    data segments.
  4. Writer B's version is unreadable.

A parallel read faces the same risk: a reader that has fetched a version's index
key but not yet fetched its data segments can lose those segments to a concurrent
prune mid-read.

The root cause is that delete_unreferenced_pruned_indexes has no knowledge of any
in-flight writer or reader that still holds a reference to the data being deleted.


Fix

write_and_prune_previous() now calls a new get_prune_previous_boundary() helper
that enforces two safety rules before tombstoning anything.

Rule 1 — Protection window.
Any version younger than VersionStore.PrunePreviousProtectionSecs (default: 600 s)
is ineligible for pruning. This gives concurrent writers and readers a guaranteed
window to complete before their base version's data can be removed.

Rule 2 — Anchor rule (eager deletion only).
Among eligible versions, the newest is always kept alive as an anchor. If there is
only one eligible candidate it becomes the anchor and nothing is pruned. The anchor
ensures there is always a safe base for any writer that has already started building
on it.


Interaction with background_deletion

With EnterpriseLibraryOptions(background_deletion=True), the anchor rule is
skipped. Physical deletion is deferred to the background deletion tool, which
performs its own reference-check before removing data — making the anchor
unnecessary. The protection window still filters which versions are eligible for
tombstoning.

Eager deletion (default) Background deletion
Tombstone written Inline Inline
Data physically deleted Inline Deferred; background tool
Anchor rule Applied Skipped
Live versions at steady state (N ≥ 3 writes) 2 (anchor + latest) 1 (latest only)

Behaviour change

Scenario Before After (eager) After (background)
1st write 1 1 1
2nd write with prune 1 2 (sole candidate kept as anchor) 1
Nth write with prune (N ≥ 3) 1 2 (anchor + latest) 1

The explicit lib.prune_previous_versions(symbol) method is not affected — it
removes all non-snapshotted versions unconditionally.


Storage impact

Eager deletion. One extra version's worth of data at steady state. For
full-replace writes this is one additional data copy; it is physically deleted on the
next prune write when the anchor becomes the tombstone boundary, so the overhead is
transient. For appends, the anchor's data segments are already protected by the
latest version's append chain, so the incremental cost is one extra index key.

Background deletion. Unchanged from before this fix.


Configuration

Setting Default Description
VersionStore.PrunePreviousProtectionSecs 600 When delayed_deletes is off, Versions younger than this (seconds) are ineligible for pruning by prune_previous_versions=True. Does not affect the explicit prune_previous_versions() method.

Set via set_config_int("VersionStore.PrunePreviousProtectionSecs", value) or
ARCTICDB_VersionStore_PrunePreviousProtectionSecs_int.

With eager deletion, both the time window and the anchor rule apply. Increase
this value if write operations in your environment can remain in-flight longer than
the default.

With background deletion, the anchor rule is not applied — all versions older
than the threshold are eligible for tombstoning. The background tool's
reference-check provides the safety guarantee instead.


Tests

C++ (test_version_map.cpp)

  • PrunePreviousProtectsBaseVersionForConcurrentWriters — deterministic replication
    of the race: two writers share V0 as base; asserts V0's data survives so both
    resulting versions are readable.
  • FollowingVersionChainWithWriteAndPrunePrevious — updated for anchor rule with
    PrunePreviousProtectionSecs=0.

Python (test_prune_previous.py)

  • test_prune_previous_preserves_recent_versions — versions within the protection
    window are not pruned.
  • test_prune_previous_single_preexisting_version_not_pruned — the sole pre-existing
    version is never pruned.
  • test_prune_previous_prunes_when_old_enough — with
    PrunePreviousProtectionSecs=0, exactly 2 versions survive (anchor + latest).

All existing prune-related tests updated for the new minimum of 2 live versions under
eager deletion.

@jamesblackburn jamesblackburn added bug Something isn't working minor Feature change, should increase minor version labels Apr 14, 2026
@man-group man-group deleted a comment from github-actions Bot Apr 14, 2026
Comment thread cpp/arcticdb/version/version_map.hpp Outdated
Comment thread cpp/arcticdb/version/version_map.hpp
@jamesblackburn jamesblackburn changed the title Fix prune previous Fix prune_previous_version race condition causing data loss under concurrent writes Apr 14, 2026
@jamesblackburn jamesblackburn marked this pull request as draft April 14, 2026 08:11
@man-group man-group deleted a comment from claude Bot Apr 14, 2026
@jamesblackburn jamesblackburn force-pushed the fix-prune-previous branch 2 times, most recently from bcfb34f to b613b4b Compare April 14, 2026 19:27
…current writes

When two writers concurrently append to the same symbol using
prune_previous_version=True, a race condition can corrupt the second
writer's version: Writer A commits V1, prunes V0 (deleting its data
segments), then Writer B commits V2 as an append on top of V0, referencing
V0's now-deleted segments.

Two complementary protections are added to write_and_prune_previous():

1. Protection window: versions younger than PrunePreviousProtectionSecs
   (default 600 s) are ineligible for pruning, giving concurrent writers
   time to commit before their base version is deleted.

2. Anchor rule: the newest eligible version is always kept alive as an
   anchor for concurrent writers still in flight.  Pruning only fires when
   there are >= 2 eligible candidates; the newest becomes the anchor and
   the second-newest becomes the tombstone boundary.  With eager deletion
   this means >= 2 live versions survive after a pruning write.

When delayed_deletes (background_deletion) is active the anchor rule is
skipped: the background tool performs its own reference-check before
physically removing data, so the extra live version is unnecessary.  The
protection window still applies in both modes.

Also fixes a pre-existing bug where the tombstone_all_key was constructed
from previous_key rather than effective_tombstone_key, placing the logical
deletion boundary one version too high.

Behaviour change: with eager deletion, >= 2 versions survive after a
pruning write instead of 1.  The explicit prune_previous_versions() admin
method is unchanged and continues to prune unconditionally.

New config: VersionStore.PrunePreviousProtectionSecs (default 600).
…tests

When PrunePreviousProtectionSecs=0 the cutoff was current_timestamp()-0
which equals current_timestamp().  Versions written in the same nanosecond
could have creation_ts >= cutoff and thus be incorrectly excluded from
pruning.  Fix: when protection_secs==0, skip the timestamp check entirely
so all versions are always eligible (the intended "no protection" semantics).

Update hypothesis model (_prune_previous_versions) to reflect that with
delayed_deletes=True the anchor itself is the tombstone boundary, so ALL
NORMAL versions including sole candidates are tombstoned.  Update
integration tests that relied on the old (broken) timestamp guard behaviour.
@jamesblackburn
Copy link
Copy Markdown
Collaborator Author

Issue with as_of reads when a version is only held by snapshots: #3034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working minor Feature change, should increase minor version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant