fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls by azteca1998 · Pull Request #6497 · lambdaclass/ethrex

azteca1998 · 2026-04-16T17:49:07Z

fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Summary

Fixes snapsync failures caused by the Kademlia k-bucket implementation introduced in PR #6458. The issue manifested as "Node failed to snapsync" errors after 3h35m with "Failed to receive block headers" when the peer count remained stuck at 6 peers throughout the sync.

Closes #XXXX (if issue exists)
Related to #6458

Problem

After PR #6458 introduced proper Kademlia k-bucket routing tables, long-running snapsync operations would fail because:

Unresponsive peers weren't being pruned: Contacts that timeout during RLPx operations were never marked as disposable, so they remained in the main bucket list indefinitely
Replacement contacts couldn't be promoted: New discovered peers went into replacement lists (when buckets were full), but since dead peers weren't removed, replacements never got promoted
No periodic pruning during sync: The prune() function was only called manually, not during sync loops
Incomplete prune implementation: The prune() function only checked main contacts, leaving disposable contacts in replacement lists

Evidence from CI Artifacts

Analysis of failed CI run artifacts (lighthouse-sepolia) showed:

Peer count stuck at 6 peers throughout entire 3h35m sync
Sync rate drops to 0 slots/s at 09:28:06 and never recovers
Final error: "Peer handler error: Failed to receive block headers"
Peers exist but become unresponsive, yet no new peers are discovered/connected

Changes

1. Enhanced `prune()` to handle replacement lists (`peer_table.rs`)

fn prune(&mut self) {
    for bucket in &mut self.buckets {
        // Remove from main list and promote replacements
        for node_id in main_disposable {
            bucket.remove_and_promote(&node_id);
            self.discarded_contacts.insert(node_id);
        }

        // Remove disposable contacts from replacement list
        bucket.replacements.retain(|(id, _)| !replacement_disposable.contains(id));
    }
}

2. Mark peers as disposable on timeout/error (`peer_handler.rs`)

Added peer_table.set_disposable(peer_id) calls when:
- PeerConnectionError::Timeout occurs in request_sync_head()
- Empty/invalid headers received from peer
- Block bodies request fails or times out

3. Added periodic pruning to sync loop (`snap_sync.rs`)

loop {
    // Prune dead/unresponsive peers periodically to allow replacements to be promoted
    let _ = peers.peer_table.prune_table();

    // ... sync operations
}

Test Plan

Code compiles with cargo check
Passes cargo fmt
Run daily snapsync test on lighthouse-sepolia (should not timeout at 3h35m)
Run daily snapsync test on prysm-sepolia
Verify peer count increases during long sync operations
Verify dead peers are replaced with active peers from replacement lists

Expected Behavior After Fix

When a peer times out during snapsync, it's marked as disposable
Next prune cycle removes it from the bucket and promotes a replacement
New peers can be discovered and connected throughout the sync
Peer count should fluctuate as dead peers are replaced with healthy ones
Long-running syncs should not stall due to unresponsive peers

Metrics to Monitor

Peer count during sync (should not stay constant for hours)
Sync rate (should not drop to 0 and stay there)
Peer churn (disposable peers removed, replacements promoted)
Time to complete snapsync (should be < 3h35m timeout)

Breaking Changes

None - this is a bug fix that makes the Kademlia implementation work as intended.

Additional Context

The root cause was that Kademlia's replacement list feature (designed to hold candidate peers until main contacts fail) wasn't working because peers were never marked as failed during sync operations. This is a critical fix for production deployments running long sync operations.

github-actions · 2026-04-16T17:51:27Z

Lines of code report

Total lines added: 26
Total lines removed: 0
Total lines changed: 26

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 555   | +4   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1250  | +21  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1020  | +1   |
+------------------------------------------------+-------+------+

…talls Fixes snapsync failures where peer count stays constant and sync eventually fails with "Failed to receive block headers" after hours of operation. Root cause: After PR #6458 introduced Kademlia k-buckets, peers that became unresponsive during sync weren't marked as disposable, so they remained in the routing table indefinitely. New peers went into replacement lists but were never promoted because dead peers weren't pruned. Changes: - Enhanced prune() to remove disposable contacts from both main and replacement lists, with automatic promotion of replacements - Mark peers as disposable when they timeout during RLPx operations (block headers, block bodies, sync head requests) - Added periodic pruning in the snap_sync main loop to ensure dead peers are regularly removed and replaced Evidence from CI artifacts showed peer count stuck at 6 throughout 3h35m sync before failure. This fix enables peer rotation so healthy peers from replacement lists can take over when active peers become unresponsive.

The Kademlia k-bucket implementation only iterated over main bucket contacts, ignoring replacement entries. This caused peer starvation because dead contacts in the main list were never replaced by fresher peers from the replacement list. Fix iter_contacts() and do_get_contact_to_initiate() to also check replacement contacts, allowing the node to discover and connect to peers that were previously invisible to the peer selection logic.

KBucket::get_mut and get_contact only searched the main contact list, so any state mutation (set_disposable, ping tracking, find_node count, mark_knows_us) silently failed for contacts in the replacement list. Since iter_contacts and do_get_contact_to_initiate now return replacement contacts, this caused phantom contacts that were visible to selection but invisible to updates. Update get_contact to use get_any (main + replacements) and get_mut to search both lists, ensuring all contact state mutations work regardless of which list holds the contact.

github-actions · 2026-04-20T11:01:51Z

🤖 Kimi Code Review

Review Summary

The PR improves peer management by marking unresponsive peers as "disposable" and ensuring replacement peers are properly utilized. While the intent is correct, there are logic errors in the pruning implementation, performance concerns, and API semantic changes that need attention.

Critical Issues

1. Disposable replacements can be promoted to main list (peer_table.rs:1036-1040)
In prune(), when removing disposable main contacts, remove_and_promote() may promote a replacement that is also marked disposable. The promoted peer remains in the main list despite being disposable.

Suggestion: Check the disposable flag before promoting, or skip disposable replacements during promotion.

// In remove_and_promote or before calling it:
while let Some((id, contact)) = bucket.replacements.pop() {
    if !contact.disposable {
        bucket.contacts.push((id, contact));
        break;
    }
    self.discarded_contacts.insert(id);
}

2. O(N²) complexity in replacement pruning (peer_table.rs:1047)
Using Vec::contains inside retain creates quadratic complexity: retain is O(N) and contains is O(M).

Suggestion: Use a HashSet<H256> for replacement_disposable or retain directly:

bucket.replacements.retain(|(id, c)| {
    if c.disposable {
        self.discarded_contacts.insert(*id);
        false
    } else {
        true
    }
});

3. Semantic change to get_mut (peer_table.rs:90-96)
get_mut now searches replacement lists, which may break invariants for code expecting only main contacts. Mutable access to replacements while they're in the cache could lead to inconsistent state.

Suggestion: Either rename to get_any_mut and audit all callers, or provide separate methods get_contact_mut and get_replacement_mut.

Error Handling & Robustness

4. Silent error suppression (peer_handler.rs:93, 453, 461, 543)
let _ = self.peer_table.set_disposable(peer_id) ignores all errors. If the peer table is locked or the peer doesn't exist, this fails silently.

Suggestion: Log at debug level:

if let Err(e) = self.peer_table.set_disposable(peer_id) {
    debug!("Failed to mark peer {} as disposable: {}", peer_id, e);
}

5. Inconsistent error handling (peer_handler.rs:542-543)

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-20T11:02:01Z

🤖 Claude Code Review

Now I have enough context for a thorough review.

PR #6497 — fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Overview

This PR fixes a Kademlia k-bucket management bug introduced in #6458 that caused snap sync to stall after hours of operation. The root cause: timed-out or misbehaving peers were never removed from k-buckets, preventing replacement contacts from being promoted and starving the node of fresh peers. The fix is well-motivated and the diagnosis is solid.

Issues Found

1. iter_contacts() now silently includes replacement contacts — potential side effects on other callers

peer_table.rs — iter_contacts() is now extended to include .chain(bucket.replacements.iter()). While this makes sense for the use in prune() (which is now rewritten and no longer uses iter_contacts()), it changes behavior for all other callers of this method. If iter_contacts() is used in Kademlia lookup operations, peer reporting, or metrics collection, including replacement contacts could cause double-counting or incorrect routing behaviour. Before merging, all callers of iter_contacts() should be audited to confirm they handle replacements correctly (or intentionally want them).

2. Aggressive immediate disposal on a single block body failure (peer_handler.rs:540)

self.peer_table.record_failure(peer_id)?;
let _ = self.peer_table.set_disposable(peer_id);

This is the only location where both record_failure and set_disposable are called together. A single missed block bodies response results in both a score penalty and permanent k-bucket disposal. The other failure paths (headers timeout, empty headers) only call set_disposable. This inconsistency aside, permanently discarding a peer on one transient body request failure may be too aggressive if the peer was momentarily overloaded. Consider adding a retry count or checking the failure count before disposal, to distinguish transient hiccups from persistently unresponsive peers.

3. prune_table() called unconditionally on every sync loop iteration (snap_sync.rs:133)

loop {
    let _ = peers.peer_table.prune_table();
    ...
}

Each call is an async round-trip to the peer table actor and iterates over all 256 k-buckets. While individually cheap (O(K × 256)), calling it on every tight loop iteration during active sync is unnecessary — peers don't become disposable at that frequency. A simple iteration counter (e.g., every 100 iterations, or on a 30-second timer) would give the same benefit without the per-iteration actor overhead.

4. Vec::contains for the replacement_disposable filter in prune() (peer_table.rs:1042)

bucket.replacements.retain(|(id, _)| !replacement_disposable.contains(id));

replacement_disposable is a Vec<H256>, making this O(n²) per bucket (though with max 10 replacements it is negligible in practice). Using a HashSet<H256> is more idiomatic and clearly signals the intent of set membership testing:

let replacement_disposable: HashSet<H256> = bucket.replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

5. Misplaced comment in peer_handler.rs around line 462–465

The new comment // Timeout or invalid response - mark peer as disposable appears after the return Ok(None) for empty/invalid headers. The comment therefore actually documents the timeout branch, not the empty-headers branch above it. The comment should either be split or repositioned to avoid misleading future readers.

6. Inconsistent error propagation for set_disposable (peer_handler.rs)

self.peer_table.record_failure(peer_id)?;   // propagates
let _ = self.peer_table.set_disposable(peer_id);  // silently dropped

record_failure propagates errors via ?, but set_disposable errors are silently discarded with let _ =. Silently dropping is acceptable here (failure to mark disposal is non-fatal), but the inconsistency between adjacent lines is surprising. A brief comment explaining why the error is intentionally dropped would improve clarity.

Positive Notes

The rewrite of prune() is cleaner and more correct than the original. Operating per-bucket instead of collecting globally and then looking up by bucket_for() is both more efficient and avoids re-deriving bucket indices.
The early-return pattern in get_mut() to search replacements avoids borrow checker complexity with chained mutable iterators — good pragmatic choice.
Adding set_disposable on timeout in ask_peer_head_number is properly scoped: only Timeout errors trigger disposal (not transient connection errors), which is the right distinction.
The do_get_contact_to_initiate() change to try replacement contacts is a useful opportunistic improvement: replacements are valid peers that simply haven't had a promotion slot, and trying to connect to them directly can accelerate peer discovery. The already_tried_peers guard prevents thrashing.

Summary

The core logic is sound and addresses the real root cause. The main concerns are the potential behavioural change from the iter_contacts() scope expansion (which deserves an explicit audit), the overly aggressive disposal on a single body failure, and the per-iteration prune overhead in the sync loop. The rest are minor style/clarity issues.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-04-20T11:02:49Z

🤖 Codex Code Review

peer_table.rs:1023 prune() can promote a replacement that is already marked disposable. The code removes disposable main contacts first (line 1035), which calls remove_and_promote(), and only afterwards filters disposable replacements (lines 1041-1050). If the oldest replacement was already bad, it gets promoted into contacts and survives this prune pass. That reintroduces a peer we explicitly decided to discard. Filter/skip disposable replacements before promotion, or keep promoting until the promoted entry is non-disposable.
peer_handler.rs:93, peer_handler.rs:453, peer_handler.rs:543, snap_sync.rs:133, peer_table.rs:1289, peer_table.rs:1310 set_disposable() does not actually remove a bad sync peer from future request selection. Header/body sync chooses from self.peers, but prune() only mutates bucket contacts and discarded_contacts; it never drops the live RLPx peer from self.peers. So a timed-out or invalid responder can still be returned by get_peer_connections() / get_random_peer() until the TCP connection dies separately, which defeats the intended “pruned and replaced” behavior. This likely needs either explicit remove_peer/disconnect logic or peer selection that also consults the contact’s disposable state.

No EVM/consensus-specific issues in this diff; the concerns are confined to P2P peer lifecycle handling. I also did not see regression tests covering either of the cases above.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-04-20T11:03:02Z

Greptile Summary

This PR fixes snapsync stalls caused by the Kademlia k-bucket implementation from #6458 by marking unresponsive peers as disposable during sync operations, extending prune() to also clean replacement lists and promote candidates, and adding a periodic prune_table() call to the snap sync loop. All three changes are needed for the fix to be complete and work together correctly.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/design suggestions with no blocking correctness issues

The fix is logically sound and addresses the root cause described in the PR. The three remaining comments are P2: a minor O(n²) nit in a 10-element list, a note about unintended scope broadening of iter_contacts, and a design question about single-miss disposal aggressiveness. None of these block correctness or safety.

peer_table.rs — the iter_contacts scope expansion affects five query functions beyond the direct fix, worth a second read

Important Files Changed

Filename	Overview
crates/networking/p2p/peer_table.rs	Extended `prune()` to also clean replacement lists and promote replacements on main-contact removal; `get_contact`/`get_mut` and `iter_contacts` now cover both main and replacement lists, broadening the scope of several query functions beyond the direct fix intent
crates/networking/p2p/peer_handler.rs	Adds `set_disposable` calls on timeout in `ask_peer_head_number`, on empty/invalid headers, and on block-body failures (alongside existing `record_failure`); aggressive single-miss disposal may shrink peer pool over long syncs
crates/networking/p2p/sync/snap_sync.rs	Adds a fire-and-forget `prune_table()` call at the top of the sync loop so dead peers are evicted and replacements promoted on each iteration

Sequence Diagram

sequenceDiagram
    participant SL as snap_sync loop
    participant PH as PeerHandler
    participant PT as PeerTableServer
    participant KB as KBucket

    SL->>PT: prune_table() [fire & forget]
    activate PT
    PT->>KB: remove disposable from main list
    KB-->>PT: promote first replacement to main
    PT->>KB: remove disposable from replacement list
    PT->>PT: add removed node_ids to discarded_contacts
    deactivate PT

    SL->>PH: request_block_headers()
    PH->>PH: get_random_peer()
    alt Timeout / empty / invalid headers
        PH->>PT: set_disposable(peer_id) [fire & forget]
        PH-->>SL: Ok(None)
    else Valid headers
        PH-->>SL: Ok(Some(headers))
    end

    SL->>PH: request_block_bodies()
    alt Failure / timeout
        PH->>PT: record_failure(peer_id)
        PH->>PT: set_disposable(peer_id) [fire & forget]
        PH-->>SL: Ok(None)
    else Valid bodies
        PH->>PT: record_success(peer_id)
        PH-->>SL: Ok(Some(bodies))
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 974-985

Comment:
**`iter_contacts` expansion silently broadens multiple consumers**

`iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1041-1050

Comment:
**O(n²) `Vec::contains` in `retain` loop**

`replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change.

```rust
let replacement_disposable: FxHashSet<H256> = bucket
    .replacements
    .iter()
    .filter(|(_, c)| c.disposable)
    .map(|(id, _)| *id)
    .collect();

bucket
    .replacements
    .retain(|(id, _)| !replacement_disposable.contains(id));
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 540-543

Comment:
**`record_failure` + `set_disposable` on every block-body miss**

A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(l1): make contact lookup and mutatio..." | Re-trigger Greptile}

greptile-apps · 2026-04-20T11:03:06Z

        None
    }

-    /// Iterate over all contacts across all buckets.
+    /// Iterate over all contacts across all buckets (main and replacement lists).
    fn iter_contacts(&self) -> impl Iterator<Item = (&H256, &Contact)> {
-        self.buckets
-            .iter()
-            .flat_map(|bucket| bucket.contacts.iter().map(|(id, c)| (id, c)))
+        self.buckets.iter().flat_map(|bucket| {
+            bucket
+                .contacts
+                .iter()
+                .chain(bucket.replacements.iter())
+                .map(|(id, c)| (id, c))
+        })


iter_contacts expansion silently broadens multiple consumers

iter_contacts is used by five functions beyond the set_disposable path that directly motivates this fix: do_get_contact_for_lookup, do_get_contact_for_enr_lookup, do_get_contact_to_revalidate, do_find_closest_nodes, and do_get_nodes_at_distances. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For do_find_closest_nodes this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/peer_table.rs Line: 974-985 Comment: **`iter_contacts` expansion silently broadens multiple consumers** `iter_contacts` is used by five functions beyond the `set_disposable` path that directly motivates this fix: `do_get_contact_for_lookup`, `do_get_contact_for_enr_lookup`, `do_get_contact_to_revalidate`, `do_find_closest_nodes`, and `do_get_nodes_at_distances`. Replacement contacts will now be revalidated, sent FindNode queries, and returned in neighbor advertisements. For `do_find_closest_nodes` this is benign (candidates are still capped at 16), but revalidating or sending FindNode to a replacement that was intentionally held back (because its bucket was full) may generate extra traffic and slightly change the discovery convergence behavior. This is worth a targeted comment acknowledging the broader scope. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T11:03:07Z

+            let replacement_disposable: Vec<H256> = bucket
+                .replacements
+                .iter()
+                .filter(|(_, c)| c.disposable)
+                .map(|(id, _)| *id)
+                .collect();
+
+            bucket
+                .replacements
+                .retain(|(id, _)| !replacement_disposable.contains(id));


O(n²) Vec::contains in retain loop

replacement_disposable is a Vec, so replacement_disposable.contains(id) inside retain is O(n) per element, making the full retain O(n²). For MAX_REPLACEMENTS_PER_BUCKET = 10 this is negligible, but converting to a FxHashSet keeps this consistent with the codebase's other hash-set patterns and is a trivial change.

let replacement_disposable: FxHashSet<H256> = bucket .replacements .iter() .filter(|(_, c)| c.disposable) .map(|(id, _)| *id) .collect(); bucket .replacements .retain(|(id, _)| !replacement_disposable.contains(id));

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/peer_table.rs Line: 1041-1050 Comment: **O(n²) `Vec::contains` in `retain` loop** `replacement_disposable` is a `Vec`, so `replacement_disposable.contains(id)` inside `retain` is O(n) per element, making the full retain O(n²). For `MAX_REPLACEMENTS_PER_BUCKET = 10` this is negligible, but converting to a `FxHashSet` keeps this consistent with the codebase's other hash-set patterns and is a trivial change. ```rust let replacement_disposable: FxHashSet<H256> = bucket .replacements .iter() .filter(|(_, c)| c.disposable) .map(|(id, _)| *id) .collect(); bucket .replacements .retain(|(id, _)| !replacement_disposable.contains(id)); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T11:03:08Z

                    "[SYNCING] Didn't receive block bodies from peer, penalizing peer {peer_id}..."
                );
                self.peer_table.record_failure(peer_id)?;
+                let _ = self.peer_table.set_disposable(peer_id);


record_failure + set_disposable on every block-body miss

A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via set_disposable, on top of the pre-existing record_failure score penalty. Since set_disposable is the stronger action—it prevents the peer from ever re-entering the main bucket—calling record_failure first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how record_critical_failure is used for misbehaving peers.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/peer_handler.rs Line: 540-543 Comment: **`record_failure` + `set_disposable` on every block-body miss** A single bad block-body response (empty body, wrong count, or timeout) now permanently disposes the peer via `set_disposable`, on top of the pre-existing `record_failure` score penalty. Since `set_disposable` is the stronger action—it prevents the peer from ever re-entering the main bucket—calling `record_failure` first is harmless but redundant. More importantly, transient failures (brief network hiccup, peer momentarily behind) will now permanently remove the peer from the routing table after a single miss, which may be too aggressive and could reduce the effective peer pool over a long sync. Consider requiring either multiple consecutive failures or a combined threshold before marking disposable, similar to how `record_critical_failure` is used for misbehaving peers. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

azteca1998 · 2026-04-21T10:59:30Z

Superseded by #6511 (Kademlia v2) which includes all these fixes.

github-actions Bot assigned azteca1998 Apr 16, 2026

azteca1998 changed the title ~~fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls~~ fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls Apr 16, 2026

github-actions Bot added the L1 Ethereum client label Apr 16, 2026

github-project-automation Bot added this to ethrex_l1 Apr 16, 2026

lferrigno force-pushed the main branch from 28f3e58 to c921e50 Compare April 17, 2026 19:31

azteca1998 added 3 commits April 17, 2026 22:06

azteca1998 force-pushed the fix/kademlia-snapsync-peer-pruning branch from 5ced7a1 to 4bdc22c Compare April 17, 2026 20:07

azteca1998 marked this pull request as ready for review April 20, 2026 10:59

azteca1998 requested a review from a team as a code owner April 20, 2026 10:59

ethrex-project-sync Bot moved this to In Review in ethrex_l1 Apr 20, 2026

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

This was referenced Apr 20, 2026

perf(p2p): randomize contact selection for RLPx connection initiation #6503

Closed

perf(p2p): add flat connection pool decoupled from Kademlia routing table #6504

Closed

feat(l1): kademlia k-bucket routing table v2 #6511

Draft

azteca1998 closed this Apr 21, 2026

github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497

fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls#6497
azteca1998 wants to merge 3 commits intomainfrom
fix/kademlia-snapsync-peer-pruning

azteca1998 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

azteca1998 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azteca1998 commented Apr 16, 2026

fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Summary

Problem

Evidence from CI Artifacts

Changes

1. Enhanced prune() to handle replacement lists (peer_table.rs)

2. Mark peers as disposable on timeout/error (peer_handler.rs)

3. Added periodic pruning to sync loop (snap_sync.rs)

Test Plan

Expected Behavior After Fix

Metrics to Monitor

Breaking Changes

Additional Context

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Claude Code Review

PR #6497 — fix(p2p): Mark unresponsive peers as disposable to prevent snapsync stalls

Overview

Issues Found

Positive Notes

Summary

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

azteca1998 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Enhanced `prune()` to handle replacement lists (`peer_table.rs`)

2. Mark peers as disposable on timeout/error (`peer_handler.rs`)

3. Added periodic pruning to sync loop (`snap_sync.rs`)

github-actions Bot commented Apr 16, 2026 •

edited

Loading