Closed
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a checkpoint hang scenario in Tsavorite/Garnet by preventing orphaned semaphores in the state machine waiting list and by ensuring snapshot flush completion signaling is not skipped when exceptions occur.
Changes:
- Prevent multiple
lastVersionTransactionsDonesemaphores from being enqueued for the same version; add typed waiting-list entries and trace logging to improve diagnosability. - Update checkpoint SM tasks to tag semaphores with origin/type to make hangs easier to identify.
- Improve a cluster replication test by using
Task.WhenAll(...).Wait(timeout)and reporting task statuses on timeout.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs | Improves checkpoint/replica-attach wait logic and timeout diagnostics in a replication cleanup test. |
| libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs | Tags index checkpoint semaphores when added to the state machine waiting list; removes unused async completion helper. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs | Prevents duplicate semaphores per version; replaces raw semaphore waiting list with typed entries and adds trace logs during waits. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs | Tags snapshot flush semaphore when enqueued. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs | Tags incremental snapshot flush semaphore when enqueued and simplifies phase handling. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs | Tags fold-over flush semaphore when enqueued. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SemaphoreWaiterMonitor.cs | Introduces StateMachineSemaphoreType and a wrapper struct to track semaphore provenance in the waiting list. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Wraps snapshot flush runner in try/catch and ensures the completion semaphore is released on exception; adjusts flush callback release placement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
badrishc
reviewed
Apr 3, 2026
badrishc
reviewed
Apr 3, 2026
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs
Outdated
Show resolved
Hide resolved
d0e05d5 to
c1a7fdd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR attempts to fix checkpoint hanging due to uncaught exception during AsyncFlushPagesForSnapshot.