fix liveness issue when leader's disk stalls#683
Conversation
46df219 to
8049248
Compare
If the leader's disk stalls, we can't replicate logs to followers. But the heartbeat RPCs will still be sent concurrently, and this resets the heartbeat timeout on the followers. This is intentional in order to avoid leadership flapping in the face of temporary disk stalls, but has the side effect of preventing recovery from a more persistent disk stall.
Have the leader set a timestamp at the start of each replication for each follower, and clear it on either success or failure. The heartbeater will check this timestamp if set, and if has been set for too long, delay the next heartbeat so that the leader can eventually step down. Fixes: #666 Fixes: #503 Fixes: #612 Fixes: #614
8049248 to
5a6195c
Compare
schmichael
left a comment
There was a problem hiding this comment.
Apologies if we already covered this and my fuzzy memory just forgot, but it might be nice to find some place to record this in the code itself:
Since the goal is to delay heartbeats in response to leader disk slowness, why is replication the place to do it? Won't that include a lot of other conditions as well such as follower disk stalls and snapshotting? Since replicateTo may send an entire snapshot, might this approach cause heartbeat failures during large snapshots?
Apologies if these are naive questions, I'm in this code extremely infrequently!
| lastReplicationStart time.Time | ||
| // lastReplicationStartLock protects 'lastReplicationStart'. | ||
| lastReplicationStartLock sync.RWMutex |
There was a problem hiding this comment.
Since the critical section is only a load or store, we could use a sync.Pointer instead. Your choice matches existing code (lastContact) though, so either way is fine.
|
|
||
| lastReplicationStart := s.getLastReplicationStart() | ||
| if !lastReplicationStart.IsZero() { | ||
| maxLastReplication := r.config().HeartbeatTimeout * 10 |
There was a problem hiding this comment.
Let's always comment magic numbers (especially since I can't remember exactly why we picked this one)
| maxLastReplication := r.config().HeartbeatTimeout * 10 | |
| // Replication timeout should be relatively long to avoid | |
| // costly leadership elections due to temporary replication | |
| // stalls. | |
| maxLastReplication := r.config().HeartbeatTimeout * 10 |
maybe?
If the leader's disk stalls, we can't replicate logs to followers. But the heartbeat RPCs will still be sent concurrently, and this resets the heartbeat timeout on the followers. This is intentional in order to avoid leadership flapping in the face of temporary disk stalls, but has the side effect of preventing recovery from a more persistent disk stall.
Have the leader set a timestamp at the start of each replication for each follower, and clear it on either success or failure. The heartbeater will check this timestamp if set, and if has been set for too long, delay the next heartbeat so that the leader can eventually step down.
Fixes: #666
Fixes: #503
Fixes: #612
Fixes: #614