feat(l1): kademlia k-bucket routing table v2#6511
Draft
azteca1998 wants to merge 7 commits intomainfrom
Draft
Conversation
…talls Fixes snapsync failures where peer count stays constant and sync eventually fails with "Failed to receive block headers" after hours of operation. Root cause: After PR #6458 introduced Kademlia k-buckets, peers that became unresponsive during sync weren't marked as disposable, so they remained in the routing table indefinitely. New peers went into replacement lists but were never promoted because dead peers weren't pruned. Changes: - Enhanced prune() to remove disposable contacts from both main and replacement lists, with automatic promotion of replacements - Mark peers as disposable when they timeout during RLPx operations (block headers, block bodies, sync head requests) - Added periodic pruning in the snap_sync main loop to ensure dead peers are regularly removed and replaced Evidence from CI artifacts showed peer count stuck at 6 throughout 3h35m sync before failure. This fix enables peer rotation so healthy peers from replacement lists can take over when active peers become unresponsive.
The Kademlia k-bucket implementation only iterated over main bucket contacts, ignoring replacement entries. This caused peer starvation because dead contacts in the main list were never replaced by fresher peers from the replacement list. Fix iter_contacts() and do_get_contact_to_initiate() to also check replacement contacts, allowing the node to discover and connect to peers that were previously invisible to the peer selection logic.
KBucket::get_mut and get_contact only searched the main contact list, so any state mutation (set_disposable, ping tracking, find_node count, mark_knows_us) silently failed for contacts in the replacement list. Since iter_contacts and do_get_contact_to_initiate now return replacement contacts, this caused phantom contacts that were visible to selection but invisible to updates. Update get_contact to use get_any (main + replacements) and get_mut to search both lists, ensuring all contact state mutations work regardless of which list holds the contact.
…able Add a separate IndexMap<H256, Node> connection pool (capacity 50K) for RLPx connection initiation, decoupled from the k-bucket routing table (which is limited to 256 × 16 = 4,096 contacts by Kademlia design). All discovered contacts are inserted into both the k-buckets (for Kademlia protocol operations like FindNode/GetClosestNodes) and the connection pool (for peer connection initiation). This restores the large candidate pool that existed before the k-bucket migration while preserving correct Kademlia routing semantics. The connection pool is: - Populated on every contact discovery (discv4, discv5, insert_if_new) - Cleaned during prune() when contacts are marked disposable - Capped at 50K entries with oldest-first eviction - Used with random selection and k-bucket state filtering
Matches the candidate pool size used by Reth and Nethermind.
- Replace O(n) collect-then-choose in do_get_contact_to_initiate with O(k) random index probing on the IndexMap (rand % len, scan forward). The old approach scanned all 10K pool entries, cloned eligible ones into a Vec, then randomly picked — blocking the peer_table actor and starving snap sync's get_best_peer calls. - Replace collect-then-choose in do_get_contact_for_lookup with IteratorRandom::choose (single-pass reservoir sampling, zero alloc). - Remove discarded_contacts permanent blacklist entirely. Contacts pruned from k-buckets now remain in the connection pool so they can be retried — the RLPx handshake rejects truly incompatible peers. Previously, a single timeout permanently blacklisted a contact from both the pool and re-discovery.
Lines of code reportTotal lines added: Detailed view |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-introduces the Kademlia k-bucket routing table (reverted in #6505 for v10 release) with all fixes and performance improvements applied:
rand() % len+ forward scan, avoiding actor contention during snap syncPerformance issues addressed
get_contact_to_initiate()call (every 100ms), blockingget_best_peer()calls from snap sync workers. Now O(k) with random start index.discarded_contactspermanently banned peers after a single timeout, shrinking the effective pool over time. Removed entirely — RLPx handshake handles rejection.Pending
Test plan