fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497
fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497azteca1998 wants to merge 3 commits intomainfrom
Conversation
Lines of code reportTotal lines added: Detailed view |
…talls Fixes snapsync failures where peer count stays constant and sync eventually fails with "Failed to receive block headers" after hours of operation. Root cause: After PR #6458 introduced Kademlia k-buckets, peers that became unresponsive during sync weren't marked as disposable, so they remained in the routing table indefinitely. New peers went into replacement lists but were never promoted because dead peers weren't pruned. Changes: - Enhanced prune() to remove disposable contacts from both main and replacement lists, with automatic promotion of replacements - Mark peers as disposable when they timeout during RLPx operations (block headers, block bodies, sync head requests) - Added periodic pruning in the snap_sync main loop to ensure dead peers are regularly removed and replaced Evidence from CI artifacts showed peer count stuck at 6 throughout 3h35m sync before failure. This fix enables peer rotation so healthy peers from replacement lists can take over when active peers become unresponsive.
The Kademlia k-bucket implementation only iterated over main bucket contacts, ignoring replacement entries. This caused peer starvation because dead contacts in the main list were never replaced by fresher peers from the replacement list. Fix iter_contacts() and do_get_contact_to_initiate() to also check replacement contacts, allowing the node to discover and connect to peers that were previously invisible to the peer selection logic.
KBucket::get_mut and get_contact only searched the main contact list, so any state mutation (set_disposable, ping tracking, find_node count, mark_knows_us) silently failed for contacts in the replacement list. Since iter_contacts and do_get_contact_to_initiate now return replacement contacts, this caused phantom contacts that were visible to selection but invisible to updates. Update get_contact to use get_any (main + replacements) and get_mut to search both lists, ensuring all contact state mutations work regardless of which list holds the contact.
5ced7a1 to
4bdc22c
Compare
🤖 Kimi Code ReviewReview Summary The PR improves peer management by marking unresponsive peers as "disposable" and ensuring replacement peers are properly utilized. While the intent is correct, there are logic errors in the pruning implementation, performance concerns, and API semantic changes that need attention. Critical Issues 1. Disposable replacements can be promoted to main list ( Suggestion: Check the disposable flag before promoting, or skip disposable replacements during promotion. // In remove_and_promote or before calling it:
while let Some((id, contact)) = bucket.replacements.pop() {
if !contact.disposable {
bucket.contacts.push((id, contact));
break;
}
self.discarded_contacts.insert(id);
}2. O(N²) complexity in replacement pruning ( Suggestion: Use a bucket.replacements.retain(|(id, c)| {
if c.disposable {
self.discarded_contacts.insert(*id);
false
} else {
true
}
});3. Semantic change to Suggestion: Either rename to Error Handling & Robustness 4. Silent error suppression ( Suggestion: Log at debug level: if let Err(e) = self.peer_table.set_disposable(peer_id) {
debug!("Failed to mark peer {} as disposable: {}", peer_id, e);
}5. Inconsistent error handling ( Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Claude Code ReviewNow I have enough context for a thorough review. PR #6497 — fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stallsOverviewThis PR fixes a Kademlia k-bucket management bug introduced in #6458 that caused snap sync to stall after hours of operation. The root cause: timed-out or misbehaving peers were never removed from k-buckets, preventing replacement contacts from being promoted and starving the node of fresh peers. The fix is well-motivated and the diagnosis is solid. Issues Found1.
2. Aggressive immediate disposal on a single block body failure ( self.peer_table.record_failure(peer_id)?;
let _ = self.peer_table.set_disposable(peer_id);This is the only location where both 3. loop {
let _ = peers.peer_table.prune_table();
...
}Each call is an async round-trip to the peer table actor and iterates over all 256 k-buckets. While individually cheap (O(K × 256)), calling it on every tight loop iteration during active sync is unnecessary — peers don't become disposable at that frequency. A simple iteration counter (e.g., every 100 iterations, or on a 30-second timer) would give the same benefit without the per-iteration actor overhead. 4. bucket.replacements.retain(|(id, _)| !replacement_disposable.contains(id));
let replacement_disposable: HashSet<H256> = bucket.replacements
.iter()
.filter(|(_, c)| c.disposable)
.map(|(id, _)| *id)
.collect();5. Misplaced comment in The new comment 6. Inconsistent error propagation for self.peer_table.record_failure(peer_id)?; // propagates
let _ = self.peer_table.set_disposable(peer_id); // silently dropped
Positive Notes
SummaryThe core logic is sound and addresses the real root cause. The main concerns are the potential behavioural change from the Automated review by Claude (Anthropic) · sonnet · custom prompt |
🤖 Codex Code Review
No EVM/consensus-specific issues in this diff; the concerns are confined to P2P peer lifecycle handling. I also did not see regression tests covering either of the cases above. Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR fixes snapsync stalls caused by the Kademlia k-bucket implementation from #6458 by marking unresponsive peers as disposable during sync operations, extending Confidence Score: 5/5Safe to merge; all findings are P2 style/design suggestions with no blocking correctness issues The fix is logically sound and addresses the root cause described in the PR. The three remaining comments are P2: a minor O(n²) nit in a 10-element list, a note about unintended scope broadening of iter_contacts, and a design question about single-miss disposal aggressiveness. None of these block correctness or safety. peer_table.rs — the iter_contacts scope expansion affects five query functions beyond the direct fix, worth a second read
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/peer_table.rs | Extended prune() to also clean replacement lists and promote replacements on main-contact removal; get_contact/get_mut and iter_contacts now cover both main and replacement lists, broadening the scope of several query functions beyond the direct fix intent |
| crates/networking/p2p/peer_handler.rs | Adds set_disposable calls on timeout in ask_peer_head_number, on empty/invalid headers, and on block-body failures (alongside existing record_failure); aggressive single-miss disposal may shrink peer pool over long syncs |
| crates/networking/p2p/sync/snap_sync.rs | Adds a fire-and-forget prune_table() call at the top of the sync loop so dead peers are evicted and replacements promoted on each iteration |
Sequence Diagram
sequenceDiagram
participant SL as snap_sync loop
participant PH as PeerHandler
participant PT as PeerTableServer
participant KB as KBucket
SL->>PT: prune_table() [fire & forget]
activate PT
PT->>KB: remove disposable from main list
KB-->>PT: promote first replacement to main
PT->>KB: remove disposable from replacement list
PT->>PT: add removed node_ids to discarded_contacts
deactivate PT
SL->>PH: request_block_headers()
PH->>PH: get_random_peer()
alt Timeout / empty / invalid headers
PH->>PT: set_disposable(peer_id) [fire & forget]
PH-->>SL: Ok(None)
else Valid headers
PH-->>SL: Ok(Some(headers))
end
SL->>PH: request_block_bodies()
alt Failure / timeout
PH->>PT: record_failure(peer_id)
PH->>PT: set_disposable(peer_id) [fire & forget]
PH-->>SL: Ok(None)
else Valid bodies
PH->>PT: record_success(peer_id)
PH-->>SL: Ok(Some(bodies))
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 974-985
Comment:
**`iter_contacts` expansion silently broadens multiple consumers**
`iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1041-1050
Comment:
**O(n²) `Vec::contains` in `retain` loop**
`replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change.
```rust
let replacement_disposable: FxHashSet<H256> = bucket
.replacements
.iter()
.filter(|(_, c)| c.disposable)
.map(|(id, _)| *id)
.collect();
bucket
.replacements
.retain(|(id, _)| !replacement_disposable.contains(id));
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 540-543
Comment:
**`record_failure` + `set_disposable` on every block-body miss**
A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(l1): make contact lookup and mutatio..." | Re-trigger Greptile
| None | ||
| } | ||
|
|
||
| /// Iterate over all contacts across all buckets. | ||
| /// Iterate over all contacts across all buckets (main and replacement lists). | ||
| fn iter_contacts(&self) -> impl Iterator<Item = (&H256, &Contact)> { | ||
| self.buckets | ||
| .iter() | ||
| .flat_map(|bucket| bucket.contacts.iter().map(|(id, c)| (id, c))) | ||
| self.buckets.iter().flat_map(|bucket| { | ||
| bucket | ||
| .contacts | ||
| .iter() | ||
| .chain(bucket.replacements.iter()) | ||
| .map(|(id, c)| (id, c)) | ||
| }) |
There was a problem hiding this comment.
iter_contacts expansion silently broadens multiple consumers
iter_contacts is used by five functions beyond the set_disposable path that directly motivates this fix: do_get_contact_for_lookup, do_get_contact_for_enr_lookup, do_get_contact_to_revalidate, do_find_closest_nodes, and do_get_nodes_at_distances. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For do_find_closest_nodes this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 974-985
Comment:
**`iter_contacts` expansion silently broadens multiple consumers**
`iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.
How can I resolve this? If you propose a fix, please make it concise.| let replacement_disposable: Vec<H256> = bucket | ||
| .replacements | ||
| .iter() | ||
| .filter(|(_, c)| c.disposable) | ||
| .map(|(id, _)| *id) | ||
| .collect(); | ||
|
|
||
| bucket | ||
| .replacements | ||
| .retain(|(id, _)| !replacement_disposable.contains(id)); |
There was a problem hiding this comment.
O(n²)
Vec::contains in retain loop
replacement_disposable is a Vec, so replacement_disposable.contains(id) inside retain is O(n) per element, making the full retain O(n²). For MAX_REPLACEMENTS_PER_BUCKET = 10 this is negligible, but converting to a FxHashSet keeps this consistent with the codebase's other hash-set patterns and is a trivial change.
let replacement_disposable: FxHashSet<H256> = bucket
.replacements
.iter()
.filter(|(_, c)| c.disposable)
.map(|(id, _)| *id)
.collect();
bucket
.replacements
.retain(|(id, _)| !replacement_disposable.contains(id));Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1041-1050
Comment:
**O(n²) `Vec::contains` in `retain` loop**
`replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change.
```rust
let replacement_disposable: FxHashSet<H256> = bucket
.replacements
.iter()
.filter(|(_, c)| c.disposable)
.map(|(id, _)| *id)
.collect();
bucket
.replacements
.retain(|(id, _)| !replacement_disposable.contains(id));
```
How can I resolve this? If you propose a fix, please make it concise.| "[SYNCING] Didn't receive block bodies from peer, penalizing peer {peer_id}..." | ||
| ); | ||
| self.peer_table.record_failure(peer_id)?; | ||
| let _ = self.peer_table.set_disposable(peer_id); |
There was a problem hiding this comment.
record_failure + set_disposable on every block-body miss
A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via set_disposable, on top of the pre-existing record_failure score penalty. Since set_disposable is the stronger action—it prevents the peer from ever re-entering the main bucket—calling record_failure first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how record_critical_failure is used for misbehaving peers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 540-543
Comment:
**`record_failure` + `set_disposable` on every block-body miss**
A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Superseded by #6511 (Kademlia v2) which includes all these fixes. |
fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls
Summary
Fixes snapsync failures caused by the Kademlia k-bucket implementation introduced in PR #6458. The issue manifested as "Node failed to snapsync" errors after 3h35m with "Failed to receive block headers" when the peer count remained stuck at 6 peers throughout the sync.
Closes #XXXX (if issue exists)
Related to #6458
Problem
After PR #6458 introduced proper Kademlia k-bucket routing tables, long-running snapsync operations would fail because:
disposable, so they remained in the main bucket list indefinitelyprune()function was only called manually, not during sync loopsprune()function only checked main contacts, leaving disposable contacts in replacement listsEvidence from CI Artifacts
Analysis of failed CI run artifacts (lighthouse-sepolia) showed:
"Peer handler error: Failed to receive block headers"Changes
1. Enhanced
prune()to handle replacement lists (peer_table.rs)2. Mark peers as disposable on timeout/error (
peer_handler.rs)peer_table.set_disposable(peer_id)calls when:PeerConnectionError::Timeoutoccurs inrequest_sync_head()3. Added periodic pruning to sync loop (
snap_sync.rs)Test Plan
cargo checkcargo fmtExpected Behavior After Fix
Metrics to Monitor
Breaking Changes
None - this is a bug fix that makes the Kademlia implementation work as intended.
Additional Context
The root cause was that Kademlia's replacement list feature (designed to hold candidate peers until main contacts fail) wasn't working because peers were never marked as failed during sync operations. This is a critical fix for production deployments running long sync operations.