fix liveness issue when leader's disk stalls by tgross · Pull Request #683 · hashicorp/raft

tgross · 2026-05-19T17:30:43Z

If the leader's disk stalls, we can't replicate logs to followers. But the heartbeat RPCs will still be sent concurrently, and this resets the heartbeat timeout on the followers. This is intentional in order to avoid leadership flapping in the face of temporary disk stalls, but has the side effect of preventing recovery from a more persistent disk stall.

Have the leader set a timestamp at the start of each replication for each follower, and clear it on either success or failure. The heartbeater will check this timestamp if set, and if has been set for too long, delay the next heartbeat so that the leader can eventually step down.

Fixes: #666
Fixes: #503
Fixes: #612
Fixes: #614

If the leader's disk stalls, we can't replicate logs to followers. But the heartbeat RPCs will still be sent concurrently, and this resets the heartbeat timeout on the followers. This is intentional in order to avoid leadership flapping in the face of temporary disk stalls, but has the side effect of preventing recovery from a more persistent disk stall.

Have the leader set a timestamp at the start of each replication for each follower, and clear it on either success or failure. The heartbeater will check this timestamp if set, and if has been set for too long, delay the next heartbeat so that the leader can eventually step down. Fixes: #666 Fixes: #503 Fixes: #612 Fixes: #614

schmichael

Apologies if we already covered this and my fuzzy memory just forgot, but it might be nice to find some place to record this in the code itself:

Since the goal is to delay heartbeats in response to leader disk slowness, why is replication the place to do it? Won't that include a lot of other conditions as well such as follower disk stalls and snapshotting? Since replicateTo may send an entire snapshot, might this approach cause heartbeat failures during large snapshots?

Apologies if these are naive questions, I'm in this code extremely infrequently!

schmichael · 2026-05-27T00:10:22Z

+	lastReplicationStart time.Time
+	// lastReplicationStartLock protects 'lastReplicationStart'.
+	lastReplicationStartLock sync.RWMutex


Since the critical section is only a load or store, we could use a sync.Pointer instead. Your choice matches existing code (lastContact) though, so either way is fine.

schmichael · 2026-05-27T00:20:15Z

+
+		lastReplicationStart := s.getLastReplicationStart()
+		if !lastReplicationStart.IsZero() {
+			maxLastReplication := r.config().HeartbeatTimeout * 10


Let's always comment magic numbers (especially since I can't remember exactly why we picked this one)

Suggested change

maxLastReplication := r.config().HeartbeatTimeout * 10

// Replication timeout should be relatively long to avoid

// costly leadership elections due to temporary replication

// stalls.

maxLastReplication := r.config().HeartbeatTimeout * 10

maybe?

tgross changed the title ~~Liveness on leader disk stall~~ fix liveness issue when leader's disk stalls May 19, 2026

tgross added the bug label May 19, 2026

tgross force-pushed the liveness-on-leader-disk-stall branch from 46df219 to 8049248 Compare May 19, 2026 17:44

tgross marked this pull request as ready for review May 19, 2026 18:00

tgross requested review from a team as code owners May 19, 2026 18:00

tgross mentioned this pull request May 19, 2026

asynchronous heartbeat can cause a stale leader to hold lease indefinitely during disk IO stall #666

Open

tgross added 2 commits May 26, 2026 20:10

tgross removed the request for review from a team May 27, 2026 00:12

tgross force-pushed the liveness-on-leader-disk-stall branch from 8049248 to 5a6195c Compare May 27, 2026 00:13

clean up lints from old PR

7007c0a

schmichael reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix liveness issue when leader's disk stalls#683

fix liveness issue when leader's disk stalls#683
tgross wants to merge 3 commits into
mainfrom
liveness-on-leader-disk-stall

tgross commented May 19, 2026

Uh oh!

schmichael left a comment

Uh oh!

schmichael May 27, 2026

Uh oh!

schmichael May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-			maxLastReplication := r.config().HeartbeatTimeout * 10
+			// Replication timeout should be relatively long to avoid
+			// costly leadership elections due to temporary replication
+			// stalls.
+			maxLastReplication := r.config().HeartbeatTimeout * 10

Conversation

tgross commented May 19, 2026

Uh oh!

schmichael left a comment

Choose a reason for hiding this comment

Uh oh!

schmichael May 27, 2026

Choose a reason for hiding this comment

Uh oh!

schmichael May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants