Skip to content

fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497

Closed
azteca1998 wants to merge 3 commits intomainfrom
fix/kademlia-snapsync-peer-pruning
Closed

fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497
azteca1998 wants to merge 3 commits intomainfrom
fix/kademlia-snapsync-peer-pruning

Conversation

@azteca1998
Copy link
Copy Markdown
Contributor

fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Summary

Fixes snapsync failures caused by the Kademlia k-bucket implementation introduced in PR #6458. The issue manifested as "Node failed to snapsync" errors after 3h35m with "Failed to receive block headers" when the peer count remained stuck at 6 peers throughout the sync.

Closes #XXXX (if issue exists)
Related to #6458

Problem

After PR #6458 introduced proper Kademlia k-bucket routing tables, long-running snapsync operations would fail because:

  1. Unresponsive peers weren't being pruned: Contacts that timeout during RLPx operations were never marked as disposable, so they remained in the main bucket list indefinitely
  2. Replacement contacts couldn't be promoted: New discovered peers went into replacement lists (when buckets were full), but since dead peers weren't removed, replacements never got promoted
  3. No periodic pruning during sync: The prune() function was only called manually, not during sync loops
  4. Incomplete prune implementation: The prune() function only checked main contacts, leaving disposable contacts in replacement lists

Evidence from CI Artifacts

Analysis of failed CI run artifacts (lighthouse-sepolia) showed:

  • Peer count stuck at 6 peers throughout entire 3h35m sync
  • Sync rate drops to 0 slots/s at 09:28:06 and never recovers
  • Final error: "Peer handler error: Failed to receive block headers"
  • Peers exist but become unresponsive, yet no new peers are discovered/connected

Changes

1. Enhanced prune() to handle replacement lists (peer_table.rs)

fn prune(&mut self) {
    for bucket in &mut self.buckets {
        // Remove from main list and promote replacements
        for node_id in main_disposable {
            bucket.remove_and_promote(&node_id);
            self.discarded_contacts.insert(node_id);
        }

        // Remove disposable contacts from replacement list
        bucket.replacements.retain(|(id, _)| !replacement_disposable.contains(id));
    }
}

2. Mark peers as disposable on timeout/error (peer_handler.rs)

  • Added peer_table.set_disposable(peer_id) calls when:
    • PeerConnectionError::Timeout occurs in request_sync_head()
    • Empty/invalid headers received from peer
    • Block bodies request fails or times out

3. Added periodic pruning to sync loop (snap_sync.rs)

loop {
    // Prune dead/unresponsive peers periodically to allow replacements to be promoted
    let _ = peers.peer_table.prune_table();

    // ... sync operations
}

Test Plan

  • Code compiles with cargo check
  • Passes cargo fmt
  • Run daily snapsync test on lighthouse-sepolia (should not timeout at 3h35m)
  • Run daily snapsync test on prysm-sepolia
  • Verify peer count increases during long sync operations
  • Verify dead peers are replaced with active peers from replacement lists

Expected Behavior After Fix

  1. When a peer times out during snapsync, it's marked as disposable
  2. Next prune cycle removes it from the bucket and promotes a replacement
  3. New peers can be discovered and connected throughout the sync
  4. Peer count should fluctuate as dead peers are replaced with healthy ones
  5. Long-running syncs should not stall due to unresponsive peers

Metrics to Monitor

  • Peer count during sync (should not stay constant for hours)
  • Sync rate (should not drop to 0 and stay there)
  • Peer churn (disposable peers removed, replacements promoted)
  • Time to complete snapsync (should be < 3h35m timeout)

Breaking Changes

None - this is a bug fix that makes the Kademlia implementation work as intended.

Additional Context

The root cause was that Kademlia's replacement list feature (designed to hold candidate peers until main contacts fail) wasn't working because peers were never marked as failed during sync operations. This is a critical fix for production deployments running long sync operations.

@azteca1998 azteca1998 changed the title fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls Apr 16, 2026
@github-actions github-actions Bot added the L1 Ethereum client label Apr 16, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

Lines of code report

Total lines added: 26
Total lines removed: 0
Total lines changed: 26

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 555   | +4   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1250  | +21  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1020  | +1   |
+------------------------------------------------+-------+------+

…talls

Fixes snapsync failures where peer count stays constant and sync
eventually fails with "Failed to receive block headers" after hours
of operation.

Root cause: After PR #6458 introduced Kademlia k-buckets, peers that
became unresponsive during sync weren't marked as disposable, so they
remained in the routing table indefinitely. New peers went into
replacement lists but were never promoted because dead peers weren't
pruned.

Changes:
- Enhanced prune() to remove disposable contacts from both main and
  replacement lists, with automatic promotion of replacements
- Mark peers as disposable when they timeout during RLPx operations
  (block headers, block bodies, sync head requests)
- Added periodic pruning in the snap_sync main loop to ensure dead
  peers are regularly removed and replaced

Evidence from CI artifacts showed peer count stuck at 6 throughout
3h35m sync before failure. This fix enables peer rotation so healthy
peers from replacement lists can take over when active peers become
unresponsive.
The Kademlia k-bucket implementation only iterated over main bucket
contacts, ignoring replacement entries. This caused peer starvation
because dead contacts in the main list were never replaced by fresher
peers from the replacement list.

Fix iter_contacts() and do_get_contact_to_initiate() to also check
replacement contacts, allowing the node to discover and connect to
peers that were previously invisible to the peer selection logic.
KBucket::get_mut and get_contact only searched the main contact list,
so any state mutation (set_disposable, ping tracking, find_node count,
mark_knows_us) silently failed for contacts in the replacement list.
Since iter_contacts and do_get_contact_to_initiate now return
replacement contacts, this caused phantom contacts that were visible
to selection but invisible to updates.

Update get_contact to use get_any (main + replacements) and get_mut
to search both lists, ensuring all contact state mutations work
regardless of which list holds the contact.
@azteca1998 azteca1998 force-pushed the fix/kademlia-snapsync-peer-pruning branch from 5ced7a1 to 4bdc22c Compare April 17, 2026 20:07
@azteca1998 azteca1998 marked this pull request as ready for review April 20, 2026 10:59
@azteca1998 azteca1998 requested a review from a team as a code owner April 20, 2026 10:59
@ethrex-project-sync ethrex-project-sync Bot moved this to In Review in ethrex_l1 Apr 20, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Review Summary

The PR improves peer management by marking unresponsive peers as "disposable" and ensuring replacement peers are properly utilized. While the intent is correct, there are logic errors in the pruning implementation, performance concerns, and API semantic changes that need attention.

Critical Issues

1. Disposable replacements can be promoted to main list (peer_table.rs:1036-1040)
In prune(), when removing disposable main contacts, remove_and_promote() may promote a replacement that is also marked disposable. The promoted peer remains in the main list despite being disposable.

Suggestion: Check the disposable flag before promoting, or skip disposable replacements during promotion.

// In remove_and_promote or before calling it:
while let Some((id, contact)) = bucket.replacements.pop() {
    if !contact.disposable {
        bucket.contacts.push((id, contact));
        break;
    }
    self.discarded_contacts.insert(id);
}

2. O(N²) complexity in replacement pruning (peer_table.rs:1047)
Using Vec::contains inside retain creates quadratic complexity: retain is O(N) and contains is O(M).

Suggestion: Use a HashSet<H256> for replacement_disposable or retain directly:

bucket.replacements.retain(|(id, c)| {
    if c.disposable {
        self.discarded_contacts.insert(*id);
        false
    } else {
        true
    }
});

3. Semantic change to get_mut (peer_table.rs:90-96)
get_mut now searches replacement lists, which may break invariants for code expecting only main contacts. Mutable access to replacements while they're in the cache could lead to inconsistent state.

Suggestion: Either rename to get_any_mut and audit all callers, or provide separate methods get_contact_mut and get_replacement_mut.

Error Handling & Robustness

4. Silent error suppression (peer_handler.rs:93, 453, 461, 543)
let _ = self.peer_table.set_disposable(peer_id) ignores all errors. If the peer table is locked or the peer doesn't exist, this fails silently.

Suggestion: Log at debug level:

if let Err(e) = self.peer_table.set_disposable(peer_id) {
    debug!("Failed to mark peer {} as disposable: {}", peer_id, e);
}

5. Inconsistent error handling (peer_handler.rs:542-543)


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context for a thorough review.


PR #6497 — fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Overview

This PR fixes a Kademlia k-bucket management bug introduced in #6458 that caused snap sync to stall after hours of operation. The root cause: timed-out or misbehaving peers were never removed from k-buckets, preventing replacement contacts from being promoted and starving the node of fresh peers. The fix is well-motivated and the diagnosis is solid.


Issues Found

1. iter_contacts() now silently includes replacement contacts — potential side effects on other callers

peer_table.rsiter_contacts() is now extended to include .chain(bucket.replacements.iter()). While this makes sense for the use in prune() (which is now rewritten and no longer uses iter_contacts()), it changes behavior for all other callers of this method. If iter_contacts() is used in Kademlia lookup operations, peer reporting, or metrics collection, including replacement contacts could cause double-counting or incorrect routing behaviour. Before merging, all callers of iter_contacts() should be audited to confirm they handle replacements correctly (or intentionally want them).

2. Aggressive immediate disposal on a single block body failure (peer_handler.rs:540)

self.peer_table.record_failure(peer_id)?;
let _ = self.peer_table.set_disposable(peer_id);

This is the only location where both record_failure and set_disposable are called together. A single missed block bodies response results in both a score penalty and permanent k-bucket disposal. The other failure paths (headers timeout, empty headers) only call set_disposable. This inconsistency aside, permanently discarding a peer on one transient body request failure may be too aggressive if the peer was momentarily overloaded. Consider adding a retry count or checking the failure count before disposal, to distinguish transient hiccups from persistently unresponsive peers.

3. prune_table() called unconditionally on every sync loop iteration (snap_sync.rs:133)

loop {
    let _ = peers.peer_table.prune_table();
    ...
}

Each call is an async round-trip to the peer table actor and iterates over all 256 k-buckets. While individually cheap (O(K × 256)), calling it on every tight loop iteration during active sync is unnecessary — peers don't become disposable at that frequency. A simple iteration counter (e.g., every 100 iterations, or on a 30-second timer) would give the same benefit without the per-iteration actor overhead.

4. Vec::contains for the replacement_disposable filter in prune() (peer_table.rs:1042)

bucket.replacements.retain(|(id, _)| !replacement_disposable.contains(id));

replacement_disposable is a Vec<H256>, making this O(n²) per bucket (though with max 10 replacements it is negligible in practice). Using a HashSet<H256> is more idiomatic and clearly signals the intent of set membership testing:

let replacement_disposable: HashSet<H256> = bucket.replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

5. Misplaced comment in peer_handler.rs around line 462–465

The new comment // Timeout or invalid response - mark peer as disposable appears after the return Ok(None) for empty/invalid headers. The comment therefore actually documents the timeout branch, not the empty-headers branch above it. The comment should either be split or repositioned to avoid misleading future readers.

6. Inconsistent error propagation for set_disposable (peer_handler.rs)

self.peer_table.record_failure(peer_id)?;   // propagates
let _ = self.peer_table.set_disposable(peer_id);  // silently dropped

record_failure propagates errors via ?, but set_disposable errors are silently discarded with let _ =. Silently dropping is acceptable here (failure to mark disposal is non-fatal), but the inconsistency between adjacent lines is surprising. A brief comment explaining why the error is intentionally dropped would improve clarity.


Positive Notes

  • The rewrite of prune() is cleaner and more correct than the original. Operating per-bucket instead of collecting globally and then looking up by bucket_for() is both more efficient and avoids re-deriving bucket indices.
  • The early-return pattern in get_mut() to search replacements avoids borrow checker complexity with chained mutable iterators — good pragmatic choice.
  • Adding set_disposable on timeout in ask_peer_head_number is properly scoped: only Timeout errors trigger disposal (not transient connection errors), which is the right distinction.
  • The do_get_contact_to_initiate() change to try replacement contacts is a useful opportunistic improvement: replacements are valid peers that simply haven't had a promotion slot, and trying to connect to them directly can accelerate peer discovery. The already_tried_peers guard prevents thrashing.

Summary

The core logic is sound and addresses the real root cause. The main concerns are the potential behavioural change from the iter_contacts() scope expansion (which deserves an explicit audit), the overly aggressive disposal on a single body failure, and the per-iteration prune overhead in the sync loop. The rest are minor style/clarity issues.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. peer_table.rs:1023 prune() can promote a replacement that is already marked disposable. The code removes disposable main contacts first (line 1035), which calls remove_and_promote(), and only afterwards filters disposable replacements (lines 1041-1050). If the oldest replacement was already bad, it gets promoted into contacts and survives this prune pass. That reintroduces a peer we explicitly decided to discard. Filter/skip disposable replacements before promotion, or keep promoting until the promoted entry is non-disposable.

  2. peer_handler.rs:93, peer_handler.rs:453, peer_handler.rs:543, snap_sync.rs:133, peer_table.rs:1289, peer_table.rs:1310 set_disposable() does not actually remove a bad sync peer from future request selection. Header/body sync chooses from self.peers, but prune() only mutates bucket contacts and discarded_contacts; it never drops the live RLPx peer from self.peers. So a timed-out or invalid responder can still be returned by get_peer_connections() / get_random_peer() until the TCP connection dies separately, which defeats the intended “pruned and replaced” behavior. This likely needs either explicit remove_peer/disconnect logic or peer selection that also consults the contact’s disposable state.

No EVM/consensus-specific issues in this diff; the concerns are confined to P2P peer lifecycle handling. I also did not see regression tests covering either of the cases above.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR fixes snapsync stalls caused by the Kademlia k-bucket implementation from #6458 by marking unresponsive peers as disposable during sync operations, extending prune() to also clean replacement lists and promote candidates, and adding a periodic prune_table() call to the snap sync loop. All three changes are needed for the fix to be complete and work together correctly.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/design suggestions with no blocking correctness issues

The fix is logically sound and addresses the root cause described in the PR. The three remaining comments are P2: a minor O(n²) nit in a 10-element list, a note about unintended scope broadening of iter_contacts, and a design question about single-miss disposal aggressiveness. None of these block correctness or safety.

peer_table.rs — the iter_contacts scope expansion affects five query functions beyond the direct fix, worth a second read

Important Files Changed

Filename Overview
crates/networking/p2p/peer_table.rs Extended prune() to also clean replacement lists and promote replacements on main-contact removal; get_contact/get_mut and iter_contacts now cover both main and replacement lists, broadening the scope of several query functions beyond the direct fix intent
crates/networking/p2p/peer_handler.rs Adds set_disposable calls on timeout in ask_peer_head_number, on empty/invalid headers, and on block-body failures (alongside existing record_failure); aggressive single-miss disposal may shrink peer pool over long syncs
crates/networking/p2p/sync/snap_sync.rs Adds a fire-and-forget prune_table() call at the top of the sync loop so dead peers are evicted and replacements promoted on each iteration

Sequence Diagram

sequenceDiagram
    participant SL as snap_sync loop
    participant PH as PeerHandler
    participant PT as PeerTableServer
    participant KB as KBucket

    SL->>PT: prune_table() [fire & forget]
    activate PT
    PT->>KB: remove disposable from main list
    KB-->>PT: promote first replacement to main
    PT->>KB: remove disposable from replacement list
    PT->>PT: add removed node_ids to discarded_contacts
    deactivate PT

    SL->>PH: request_block_headers()
    PH->>PH: get_random_peer()
    alt Timeout / empty / invalid headers
        PH->>PT: set_disposable(peer_id) [fire & forget]
        PH-->>SL: Ok(None)
    else Valid headers
        PH-->>SL: Ok(Some(headers))
    end

    SL->>PH: request_block_bodies()
    alt Failure / timeout
        PH->>PT: record_failure(peer_id)
        PH->>PT: set_disposable(peer_id) [fire & forget]
        PH-->>SL: Ok(None)
    else Valid bodies
        PH->>PT: record_success(peer_id)
        PH-->>SL: Ok(Some(bodies))
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 974-985

Comment:
**`iter_contacts` expansion silently broadens multiple consumers**

`iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1041-1050

Comment:
**O(n²) `Vec::contains` in `retain` loop**

`replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change.

```rust
let replacement_disposable: FxHashSet<H256> = bucket
    .replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

bucket
    .replacements
    .retain(|(id, _)| !replacement_disposable.contains(id));
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 540-543

Comment:
**`record_failure` + `set_disposable` on every block-body miss**

A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(l1): make contact lookup and mutatio..." | Re-trigger Greptile

Comment on lines 974 to +985
None
}

/// Iterate over all contacts across all buckets.
/// Iterate over all contacts across all buckets (main and replacement lists).
fn iter_contacts(&self) -> impl Iterator<Item = (&H256, &Contact)> {
self.buckets
.iter()
.flat_map(|bucket| bucket.contacts.iter().map(|(id, c)| (id, c)))
self.buckets.iter().flat_map(|bucket| {
bucket
.contacts
.iter()
.chain(bucket.replacements.iter())
.map(|(id, c)| (id, c))
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 iter_contacts expansion silently broadens multiple consumers

iter_contacts is used by five functions beyond the set_disposable path that directly motivates this fix: do_get_contact_for_lookup, do_get_contact_for_enr_lookup, do_get_contact_to_revalidate, do_find_closest_nodes, and do_get_nodes_at_distances. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For do_find_closest_nodes this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 974-985

Comment:
**`iter_contacts` expansion silently broadens multiple consumers**

`iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1041 to +1050
let replacement_disposable: Vec<H256> = bucket
.replacements
.iter()
.filter(|(_, c)| c.disposable)
.map(|(id, _)| *id)
.collect();

bucket
.replacements
.retain(|(id, _)| !replacement_disposable.contains(id));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 O(n²) Vec::contains in retain loop

replacement_disposable is a Vec, so replacement_disposable.contains(id) inside retain is O(n) per element, making the full retain O(n²). For MAX_REPLACEMENTS_PER_BUCKET = 10 this is negligible, but converting to a FxHashSet keeps this consistent with the codebase's other hash-set patterns and is a trivial change.

let replacement_disposable: FxHashSet<H256> = bucket
    .replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

bucket
    .replacements
    .retain(|(id, _)| !replacement_disposable.contains(id));
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1041-1050

Comment:
**O(n²) `Vec::contains` in `retain` loop**

`replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change.

```rust
let replacement_disposable: FxHashSet<H256> = bucket
    .replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

bucket
    .replacements
    .retain(|(id, _)| !replacement_disposable.contains(id));
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 540 to +543
"[SYNCING] Didn't receive block bodies from peer, penalizing peer {peer_id}..."
);
self.peer_table.record_failure(peer_id)?;
let _ = self.peer_table.set_disposable(peer_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 record_failure + set_disposable on every block-body miss

A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via set_disposable, on top of the pre-existing record_failure score penalty. Since set_disposable is the stronger action—it prevents the peer from ever re-entering the main bucket—calling record_failure first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how record_critical_failure is used for misbehaving peers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 540-543

Comment:
**`record_failure` + `set_disposable` on every block-body miss**

A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@azteca1998
Copy link
Copy Markdown
Contributor Author

Superseded by #6511 (Kademlia v2) which includes all these fixes.

@azteca1998 azteca1998 closed this Apr 21, 2026
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant