Skip to content

fix liveness issue when leader's disk stalls#683

Open
tgross wants to merge 3 commits into
mainfrom
liveness-on-leader-disk-stall
Open

fix liveness issue when leader's disk stalls#683
tgross wants to merge 3 commits into
mainfrom
liveness-on-leader-disk-stall

Conversation

@tgross
Copy link
Copy Markdown
Member

@tgross tgross commented May 19, 2026

If the leader's disk stalls, we can't replicate logs to followers. But the heartbeat RPCs will still be sent concurrently, and this resets the heartbeat timeout on the followers. This is intentional in order to avoid leadership flapping in the face of temporary disk stalls, but has the side effect of preventing recovery from a more persistent disk stall.

Have the leader set a timestamp at the start of each replication for each follower, and clear it on either success or failure. The heartbeater will check this timestamp if set, and if has been set for too long, delay the next heartbeat so that the leader can eventually step down.

Fixes: #666
Fixes: #503
Fixes: #612
Fixes: #614

@tgross tgross changed the title Liveness on leader disk stall fix liveness issue when leader's disk stalls May 19, 2026
@tgross tgross added the bug label May 19, 2026
@tgross tgross force-pushed the liveness-on-leader-disk-stall branch from 46df219 to 8049248 Compare May 19, 2026 17:44
@tgross tgross marked this pull request as ready for review May 19, 2026 18:00
@tgross tgross requested review from a team as code owners May 19, 2026 18:00
tgross added 2 commits May 26, 2026 20:10
If the leader's disk stalls, we can't replicate logs to followers. But the
heartbeat RPCs will still be sent concurrently, and this resets the heartbeat
timeout on the followers. This is intentional in order to avoid leadership
flapping in the face of temporary disk stalls, but has the side effect of
preventing recovery from a more persistent disk stall.
Have the leader set a timestamp at the start of each replication for each
follower, and clear it on either success or failure. The heartbeater will check
this timestamp if set, and if has been set for too long, delay the next
heartbeat so that the leader can eventually step down.

Fixes: #666
Fixes: #503
Fixes: #612
Fixes: #614
@tgross tgross removed the request for review from a team May 27, 2026 00:12
@tgross tgross force-pushed the liveness-on-leader-disk-stall branch from 8049248 to 5a6195c Compare May 27, 2026 00:13
Copy link
Copy Markdown
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if we already covered this and my fuzzy memory just forgot, but it might be nice to find some place to record this in the code itself:

Since the goal is to delay heartbeats in response to leader disk slowness, why is replication the place to do it? Won't that include a lot of other conditions as well such as follower disk stalls and snapshotting? Since replicateTo may send an entire snapshot, might this approach cause heartbeat failures during large snapshots?

Apologies if these are naive questions, I'm in this code extremely infrequently!

Comment thread replication.go
Comment on lines +80 to +82
lastReplicationStart time.Time
// lastReplicationStartLock protects 'lastReplicationStart'.
lastReplicationStartLock sync.RWMutex
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the critical section is only a load or store, we could use a sync.Pointer instead. Your choice matches existing code (lastContact) though, so either way is fine.

Comment thread replication.go

lastReplicationStart := s.getLastReplicationStart()
if !lastReplicationStart.IsZero() {
maxLastReplication := r.config().HeartbeatTimeout * 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's always comment magic numbers (especially since I can't remember exactly why we picked this one)

Suggested change
maxLastReplication := r.config().HeartbeatTimeout * 10
// Replication timeout should be relatively long to avoid
// costly leadership elections due to temporary replication
// stalls.
maxLastReplication := r.config().HeartbeatTimeout * 10

maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants