fix(l1): use SUPPORTED_SNAP_CAPABILITIES in snap client peer selection#6154
fix(l1): use SUPPORTED_SNAP_CAPABILITIES in snap client peer selection#6154pablodeymo wants to merge 39 commits intomainfrom
Conversation
…tory Reorganize state_healing.rs and storage_healing.rs into a shared sync/healing/ module structure with clearer naming conventions: - Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs - Rename MembatchEntryValue to HealingQueueEntry - Rename MembatchEntry to StorageHealingQueueEntry - Rename Membatch type to StorageHealingQueue - Rename children_not_in_storage_count to missing_children_count - Rename membatch variables to healing_queue throughout - Extract shared HealingQueueEntry and StateHealingQueue types to types.rs - Update sync.rs imports to use new healing module
Reorganize snap protocol code for better maintainability: - Split rlpx/snap.rs into rlpx/snap/ directory: - codec.rs: RLP encoding/decoding for snap messages - messages.rs: Snap protocol message types - mod.rs: Module re-exports - Split snap.rs into snap/ directory: - constants.rs: Snap sync constants and configuration - server.rs: Snap protocol server implementation - mod.rs: Module re-exports - Move snap server tests to dedicated tests/ directory - Update imports in p2p.rs, peer_handler.rs, and code_collector.rs
Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification
Split the large sync.rs (1631 lines) into focused modules: - sync/full.rs (~260 lines): Full sync implementation - sync_cycle_full(), add_blocks_in_batch(), add_blocks() - sync/snap_sync.rs (~1100 lines): Snap sync implementation - sync_cycle_snap(), snap_sync(), SnapBlockSyncState - store_block_bodies(), update_pivot(), block_is_stale() - validate_state_root(), validate_storage_root(), validate_bytecodes() - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb) - sync.rs (~285 lines): Orchestration layer - Syncer struct with start_sync() and sync_cycle() - SyncMode, SyncError, AccountStorageRoots types - Re-exports for public API
…p/client.rs Move all snap protocol client-side request methods from peer_handler.rs to a dedicated snap/client.rs module: - request_account_range and request_account_range_worker - request_bytecodes - request_storage_ranges and request_storage_ranges_worker - request_state_trienodes - request_storage_trienodes Also moves related types: DumpError, RequestMetadata, SnapClientError, RequestStateTrieNodesError, RequestStorageTrieNodes. This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction), leaving it focused on ETH protocol methods (block headers/bodies). Added SnapClientError variant to SyncError for proper error handling. Updated plan_snap_sync.md to mark Phase 4 as complete.
…napError type Implement Phase 5 of snap sync refactoring plan - Error Handling. - Create snap/error.rs with unified SnapError enum covering all snap protocol errors - Update server functions (process_account_range_request, process_storage_ranges_request, process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError> - Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError - Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing - Add From<SnapError> for PeerConnectionError to support error propagation in message handlers - Update sync module to use SyncError::Snap variant - Update healing modules (state.rs, storage.rs) to use new error types - Move DumpError struct to error.rs module - Update test return types to use SnapError - Mark Phase 5 as completed in plan document All phases of the snap sync refactoring are now complete.
Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.
Resolve conflicts from #5977 and #6018 merge to main: - Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs) - Keep snap client code in snap/client.rs (removed from peer_handler.rs) - Add InsertingAccountRanges metric from #6018 to snap_sync.rs - Remove unused info import from peer_handler.rs
in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.
peer selection, extract magic numbers into named constants (ACCOUNT_RANGE_CHUNK_COUNT, STORAGE_BATCH_SIZE, HASH_MAX), remove unused contents field from DumpError and use derive Debug, rename missing_children to pending_children in healing code, and wrap process_byte_codes_request in spawn_blocking for consistency with other server handlers.
🤖 Kimi Code ReviewReview SummaryThis PR changes the peer selection criteria for SNAP protocol requests from Issues Found1. Critical Bug: Incorrect Capability Matching (Lines 740, 1027, 1553)The changes replace
Current behavior: The code will only find peers that support SNAP capabilities, but these peers might not support ETH capabilities, making the SNAP protocol requests impossible to execute. Fix: Use both ETH and SNAP capabilities: // Create a combined capability set
let mut required_caps = SUPPORTED_ETH_CAPABILITIES.clone();
required_caps.extend(SUPPORTED_SNAP_CAPABILITIES);
let Some((peer_id, connection)) = self
.peer_table
.get_best_peer(&required_caps)
.await
// ... rest of the code2. Missing Capability Validation (All changed lines)There's no validation that the selected peer actually supports the required capabilities for the specific request type. Consider adding debug logging to verify peer capabilities before making requests. Additional Notes
Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code ReviewLooks good overall — this change makes the SNAP-specific request paths select SNAP-capable peers, which is the correct protocol selection and should prevent invalid requests to ETH-only peers. Findings
If you want, I can verify call sites for these SNAP sync routines to confirm they’re only invoked when SNAP is enabled. Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewExcellent! Now I have a comprehensive understanding. Let me create the review. PR Review: Fix snap client peer selectionSummaryThis PR correctly fixes a bug where snap protocol operations were selecting peers based on ✅ Correctness & LogicLGTM - The fix is correct and well-targeted:
🔒 Security & Correctness ImpactLow risk, high value fix:
🎯 Ethereum/Snap Sync SpecificProtocol compliance improvement:
📝 Code QualityClean implementation:
Minor ObservationThe PR description mentions the capability definitions are: SUPPORTED_ETH_CAPABILITIES: [eth/68, eth/69]
SUPPORTED_SNAP_CAPABILITIES: [snap/1]These are defined in Verdict✅ Approve - This is a straightforward correctness fix that:
The code is ready to merge. Automated review by Claude (Anthropic) · custom prompt |
🤖 Kimi Code ReviewSecurity & Correctness IssuesCritical Issue: Capability mismatch in snap protocol requests The changes replace Looking at the context:
The fix should be: // Create a combined capability set for snap requests
let mut required_caps = SUPPORTED_ETH_CAPABILITIES.to_vec();
required_caps.extend(SUPPORTED_SNAP_CAPABILITIES);
self.peer_table.get_best_peer(&required_caps)Additional Concerns1. Missing capability validation
2. Error handling inconsistency
3. Code duplication
Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code ReviewNo correctness or security issues found in this diff. Notes
Open question
Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewPerfect! PR Review: Use SUPPORTED_SNAP_CAPABILITIES for snap client peer selectionSummaryThis PR fixes a critical bug where snap sync operations were selecting peers based on ETH protocol capabilities instead of SNAP protocol capabilities. This is a well-identified and correctly implemented fix. ✅ Positive Findings
🔍 Code QualityStrengths:
No issues found with:
🎯 Ethereum-Specific Considerations
📊 Impact AssessmentBefore Fix:
After Fix:
✅ Final VerdictLGTM - Approved This is a textbook bug fix:
No changes requested. Automated review by Claude (Anthropic) · custom prompt |
Greptile OverviewGreptile Summary
Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/peer_handler.rs | Swaps snap client peer selection from SUPPORTED_ETH_CAPABILITIES to SUPPORTED_SNAP_CAPABILITIES in request_account_range/request_bytecodes/request_storage_ranges; adds corresponding import. No functional regressions found in the changed lines. |
Sequence Diagram
sequenceDiagram
participant Sync as Snap sync loop
participant PH as PeerHandler
participant PT as PeerTable
participant Peer as Selected peer
Sync->>PH: "Snap request_* call"
PH->>PT: "get_best_peer(SUPPORTED_SNAP_CAPABILITIES)"
alt No suitable peer
PT-->>PH: "None"
PH-->>Sync: "Sleep + retry"
else Suitable peer found
PT-->>PH: "(peer_id, PeerConnection)"
PH->>Peer: "GetAccountRange / GetByteCodes / GetStorageRanges"
Peer-->>PH: "AccountRange / ByteCodes / StorageRanges"
PH-->>Sync: "Record success/failure, process result"
end
…ation' into fix/snap-protocol-capabilities # Conflicts: # crates/networking/p2p/peer_handler.rs
…nto fix/snap-protocol-capabilities
Lines of code reportTotal lines added: Detailed view |
There was a problem hiding this comment.
Pull request overview
This PR fixes snap sync peer selection to require snap/1 capability (avoiding timeouts/peer-score hits when sending snap messages to non-snap peers), and includes a broader refactor that modularizes snap/sync code and adds snap server test coverage.
Changes:
- Use
SUPPORTED_SNAP_CAPABILITIESfor snap client peer selection (account ranges, bytecodes, storage ranges). - Split snap/sync implementation into dedicated modules (
snap/,sync/full.rs,sync/snap_sync.rs,sync/healing/*) and centralize snap constants/errors. - Add Hive-derived snap server tests to the
testcrate.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/networking/p2p/p2p.rs | Makes snap module public (module visibility change as part of refactor). |
| crates/networking/p2p/peer_handler.rs | Removes snap client methods from PeerHandler, re-exports snap constants/types/errors for compatibility. |
| crates/networking/p2p/utils.rs | Adjusts snapshot dump error payload after DumpError change. |
| crates/networking/p2p/snap/mod.rs | Introduces snap module facade + re-exports (client/server/errors/constants). |
| crates/networking/p2p/snap/client.rs | Snap client request implementations now select peers via SUPPORTED_SNAP_CAPABILITIES. |
| crates/networking/p2p/snap/server.rs | Server-side snap request processing moved here (now consistently async via spawn_blocking). |
| crates/networking/p2p/snap/constants.rs | Centralizes snap/sync constants (limits, batch sizes, timeouts, etc.). |
| crates/networking/p2p/snap/error.rs | Introduces unified SnapError + DumpError definitions. |
| crates/networking/p2p/snap.rs | Removes old monolithic snap module implementation (deleted). |
| crates/networking/p2p/rlpx/snap/mod.rs | New RLPx snap module layout (re-exports messages + codec codes). |
| crates/networking/p2p/rlpx/snap/messages.rs | Defines snap protocol message structs. |
| crates/networking/p2p/rlpx/snap/codec.rs | Implements RLPxMessage encoding/decoding + message code constants. |
| crates/networking/p2p/rlpx/error.rs | Adds From<SnapError> mapping into PeerConnectionError. |
| crates/networking/p2p/rlpx/connection/server.rs | Updates bytecode request handling to await the now-async server handler. |
| crates/networking/p2p/sync.rs | Sync module now delegates to full, snap_sync, and healing submodules. |
| crates/networking/p2p/sync/full.rs | New full-sync implementation module. |
| crates/networking/p2p/sync/snap_sync.rs | New snap-sync implementation module (pivot selection, validation helpers, etc.). |
| crates/networking/p2p/sync/code_collector.rs | Moves buffer sizing constant to snap::constants. |
| crates/networking/p2p/sync/healing/mod.rs | Introduces healing module + re-exports. |
| crates/networking/p2p/sync/healing/types.rs | Adds shared healing queue entry types. |
| crates/networking/p2p/sync/healing/state.rs | Refactors state healing to use shared queue types + snap request helpers/constants. |
| crates/networking/p2p/sync/healing/storage.rs | Refactors storage healing similarly; switches to snap request helpers/constants/types. |
| test/tests/p2p/mod.rs | Registers new snap server tests module. |
| test/tests/p2p/snap_server_tests.rs | Adds Hive-derived AccountRange server tests and test state fixture. |
| test/Cargo.toml | Adds lazy_static dependency for tests. |
| Cargo.lock | Adds lazy_static to lockfile. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // We multiply the estimation by 0.9 in order to account for missing slots (~9% in tesnets) | ||
| let new_pivot_block_number = block_number | ||
| + ((current_unix_time().saturating_sub(block_timestamp) / SECONDS_PER_BLOCK) as f64 | ||
| * MISSING_SLOTS_PERCENTAGE) as u64; | ||
| debug!( |
There was a problem hiding this comment.
The comment says the pivot estimation is multiplied by 0.9, but the code multiplies by MISSING_SLOTS_PERCENTAGE (currently 0.8). Please reconcile the comment with the actual constant/value used so the pivot update behavior is clear and not accidentally mis-tuned.
| if validated.is_ok() { | ||
| info!("Succesfully validated tree, {state_root} found"); | ||
| } else { | ||
| error!("We have failed the validation of the state tree"); | ||
| std::process::exit(1); | ||
| } | ||
| validated.is_ok() |
There was a problem hiding this comment.
These validation helpers call std::process::exit(1) on failure. Even if currently invoked under debug_assert!, exiting the process from a library/module function makes failures hard to handle/test and can bypass upstream cleanup. Prefer returning an error/boolean and let the caller decide whether to panic/abort.
| && first_block_hash == current_head | ||
| && current_head != sync_head | ||
| { | ||
| // There is no path to the sync head this goes back until it find a common ancerstor |
There was a problem hiding this comment.
Typo in comment: “ancerstor” → “ancestor”.
| // There is no path to the sync head this goes back until it find a common ancerstor | |
| // There is no path to the sync head this goes back until it find a common ancestor |
| pub mod network; | ||
| pub mod peer_handler; | ||
| pub mod rlpx; | ||
| pub(crate) mod snap; | ||
| pub mod snap; | ||
| pub mod sync; | ||
| pub mod sync_manager; | ||
| pub mod tx_broadcaster; |
There was a problem hiding this comment.
The PR description/title suggest a small peer-capability selection fix, but this change set introduces a much broader refactor (new public snap module, new sync submodules, moved tests, error type changes). Please update the PR description/scope (or split into separate PRs) so reviewers can reason about risk and intent accurately.
| /// Assumed percentage of slots that are missing blocks. | ||
| /// | ||
| /// This is used to adjust timestamp-based pivot updates and to find "safe" | ||
| /// blocks in the chain that are unlikely to be re-orged. |
There was a problem hiding this comment.
MISSING_SLOTS_PERCENTAGE is documented as “percentage of slots that are missing blocks”, but it’s used as a multiplier when estimating the new pivot distance. With the current value 0.8, the math and naming/docs appear inconsistent (and snap_sync.rs comments mention 0.9). Please align the constant name/value and the pivot update logic/comment so it’s unambiguous what factor is being applied.
| /// Assumed percentage of slots that are missing blocks. | |
| /// | |
| /// This is used to adjust timestamp-based pivot updates and to find "safe" | |
| /// blocks in the chain that are unlikely to be re-orged. | |
| /// Multiplier applied when estimating the pivot distance to account for empty slots. | |
| /// | |
| /// This factor scales the timestamp-based pivot distance so we don't overestimate | |
| /// the number of blocks that should exist. A value of `0.8` means we assume that | |
| /// up to ~20% of slots may be missing blocks when computing a "safe" pivot that is | |
| /// unlikely to be re-orged. |
|
|
||
| // Reward peer | ||
| peers.peer_table.record_success(&peer_id).await?; | ||
| info!("Succesfully updated pivot"); |
There was a problem hiding this comment.
Typo in log message: “Succesfully” → “Successfully”.
| info!("Succesfully updated pivot"); | |
| info!("Successfully updated pivot"); |
| .expect("We should be able to create threads"); | ||
|
|
||
| if validated.is_ok() { | ||
| info!("Succesfully validated tree, {state_root} found"); |
There was a problem hiding this comment.
Typo in log message: “Succesfully” → “Successfully”.
| info!("Succesfully validated tree, {state_root} found"); | |
| info!("Successfully validated tree, {state_root} found"); |
| if is_valid.is_err() { | ||
| std::process::exit(1); | ||
| } |
There was a problem hiding this comment.
validate_storage_root calls std::process::exit(1) on failure. Prefer returning an error/boolean (and letting the caller decide) rather than terminating the process from within a helper.
| if !is_valid { | ||
| std::process::exit(1); | ||
| } |
There was a problem hiding this comment.
validate_bytecodes calls std::process::exit(1) when missing code is detected. It would be more testable and easier to integrate if this returned an error/boolean and allowed the caller to decide how to handle the failure.
|
Fixed in #5975 |
Already fixed on the refactor/snapsync-healing-unification branch, PR #6154 for main.
Already fixed on the refactor/snapsync-healing-unification branch, PR #6154 for main.
Motivation
All three snap client
get_best_peer()calls inpeer_handler.rsuseSUPPORTED_ETH_CAPABILITIESinstead ofSUPPORTED_SNAP_CAPABILITIES. This selects peers that supporteth/68oreth/69but may not supportsnap/1. The snap client then sends snap protocol messages (GetAccountRange,GetStorageRanges,GetByteCodes) to these peers, which will either ignore the message (timeout, wasted 15s) or disconnect (peer score hit).The healing modules (
healing/state.rs,healing/storage.rs) already useSUPPORTED_SNAP_CAPABILITIEScorrectly.Description
Replace
SUPPORTED_ETH_CAPABILITIESwithSUPPORTED_SNAP_CAPABILITIESin 3 snap client call sites + add import inpeer_handler.rs.Changes
SUPPORTED_ETH_CAPABILITIESSUPPORTED_ETH_CAPABILITIES, SUPPORTED_SNAP_CAPABILITIESrequest_account_rangeget_best_peer(&SUPPORTED_ETH_CAPABILITIES)get_best_peer(&SUPPORTED_SNAP_CAPABILITIES)request_byte_codesget_best_peer(&SUPPORTED_ETH_CAPABILITIES)get_best_peer(&SUPPORTED_SNAP_CAPABILITIES)request_storage_rangesget_best_peer(&SUPPORTED_ETH_CAPABILITIES)get_best_peer(&SUPPORTED_SNAP_CAPABILITIES)Not changed (correctly use ETH):
request_block_headers— ETH protocol operationget_peer_connections— ETH protocol operationget_random_peer— ETH protocol operationssync.rs:1075:update_pivot— usesGetBlockHeaders(ETH)Impact
In practice, most peers that support ETH also support snap (negotiated during the RLPx handshake). But some don't — this causes unnecessary timeouts and peer score degradation during snap sync.
How to Test
cargo clippy -p ethrex-p2p && cargo build -p ethrex-p2p && cargo test -p ethrex-p2pRelated Issues
request_storage_rangesfor correctness, readability, and performance #6140