Conversation
Greptile OverviewGreptile SummaryAdded a comprehensive snap sync roadmap document outlining strategic improvements for the ethrex snap sync module across two phases: performance optimization (8 improvements including parallel downloads, batching optimizations, and async I/O) and code quality enhancement (7 improvements including documentation, testing, and refactoring). Key additions:
The document is well-structured with clear sections, detailed technical proposals with code examples, and practical implementation guidance. This is a documentation-only change with no code modifications. Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| docs/roadmaps/snap_sync_roadmap.md | Added comprehensive snap sync roadmap document with performance optimization and code quality improvement plans |
Sequence Diagram
sequenceDiagram
participant Dev as Developer
participant Doc as Roadmap Document
participant Team as Development Team
participant Code as Snap Sync Codebase
Dev->>Doc: Create roadmap document
Note over Doc: Phase 1: Performance Optimization
Note over Doc: Phase 2: Code Quality & Maintainability
Doc->>Team: Provides strategic plan
Team->>Doc: Review improvement areas
Note over Doc,Code: 8 Performance Improvements<br/>(Parallel downloads, batching, I/O)
Note over Doc,Code: 7 Code Quality Improvements<br/>(Context structs, documentation, testing)
Doc->>Team: Timeline: ~16 weeks total
Doc->>Team: Success metrics defined
Doc->>Team: Risk assessment provided
Team->>Code: Implements improvements iteratively
Code->>Team: Achieves competitive sync performance
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive roadmap document for improving the snap sync module in ethrex. The document outlines a two-phase approach: Phase 1 focuses on performance optimization to reduce sync times by 50% or more, while Phase 2 focuses on code quality and maintainability improvements.
Changes:
- Added detailed technical roadmap for snap sync module improvements
- Documented current state, bottlenecks, and proposed optimizations
- Included implementation examples, timelines, and success metrics
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (DownloadingHeaders, HeadersComplete) => { | ||
| self.phase = DownloadingAccounts; |
There was a problem hiding this comment.
The assignment self.phase = DownloadingAccounts; uses an unqualified enum variant. It should be self.phase = SnapSyncPhase::DownloadingAccounts; for proper Rust syntax.
| (DownloadingHeaders, HeadersComplete) => { | |
| self.phase = DownloadingAccounts; | |
| (SnapSyncPhase::DownloadingHeaders, SnapSyncEvent::HeadersComplete) => { | |
| self.phase = SnapSyncPhase::DownloadingAccounts; |
| struct AdaptivePeerConfig { | ||
| base_timeout: Duration, | ||
| peer_latencies: HashMap<H256, RollingAverage>, | ||
|
|
||
| fn timeout_for_peer(&self, peer_id: &H256) -> Duration { | ||
| self.peer_latencies | ||
| .get(peer_id) | ||
| .map(|avg| avg.mean() * 3.0) // 3x average latency | ||
| .unwrap_or(self.base_timeout) | ||
| } | ||
| } |
There was a problem hiding this comment.
The struct definition has a syntax error. The fn timeout_for_peer should be in an impl block, not inside the struct definition. The struct should only contain fields, and methods should be in a separate impl AdaptivePeerConfig block.
| pub struct SnapSyncStateMachine { | ||
| phase: SnapSyncPhase, | ||
| progress: SnapSyncProgress, | ||
|
|
||
| pub fn transition(&mut self, event: SnapSyncEvent) -> Result<(), SyncError> { | ||
| match (self.phase, event) { | ||
| (DownloadingHeaders, HeadersComplete) => { | ||
| self.phase = DownloadingAccounts; | ||
| Ok(()) | ||
| } | ||
| // ... other transitions | ||
| _ => Err(SyncError::InvalidStateTransition), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The struct definition has a syntax error. The pub fn transition should be in an impl block, not inside the struct definition. The struct should only contain fields (phase and progress), and methods should be in a separate impl SnapSyncStateMachine block.
| (DownloadingHeaders, HeadersComplete) => { | ||
| self.phase = DownloadingAccounts; |
There was a problem hiding this comment.
The match pattern uses unqualified enum variants DownloadingHeaders and HeadersComplete, but they should be qualified as SnapSyncPhase::DownloadingHeaders and SnapSyncEvent::HeadersComplete respectively (assuming HeadersComplete is a variant of SnapSyncEvent).
| (DownloadingHeaders, HeadersComplete) => { | |
| self.phase = DownloadingAccounts; | |
| (SnapSyncPhase::DownloadingHeaders, SnapSyncEvent::HeadersComplete) => { | |
| self.phase = SnapSyncPhase::DownloadingAccounts; |
fedacking
left a comment
There was a problem hiding this comment.
Left some comments for this plan that should be addressed.
| ### 1.5 Memory-Bounded Structures | ||
|
|
||
| **Current State:** | ||
| - `accounts_by_root_hash` in `request_storage_ranges()` is unbounded |
There was a problem hiding this comment.
This structure doesn't grow to gigabytes on mainnet due to the limited account with storage.
| | Bottleneck | Location | Impact | Priority | | ||
| |------------|----------|--------|----------| | ||
| | Sequential header download | `sync_cycle_snap()` | Blocks state download start | Critical | | ||
| | Single-threaded account range processing | `request_account_range()` | Underutilizes peers | High | |
There was a problem hiding this comment.
The requests to peers is done in parallel, this is wrong.
| |------------|----------|--------|----------| | ||
| | Sequential header download | `sync_cycle_snap()` | Blocks state download start | Critical | | ||
| | Single-threaded account range processing | `request_account_range()` | Underutilizes peers | High | | ||
| | Inefficient trie node batching | `heal_state_trie()`, `heal_storage_trie()` | Excessive DB writes | High | |
There was a problem hiding this comment.
I'm not sure what it constitutes as excessive db writes. Writes are batched.
| | Sequential header download | `sync_cycle_snap()` | Blocks state download start | Critical | | ||
| | Single-threaded account range processing | `request_account_range()` | Underutilizes peers | High | | ||
| | Inefficient trie node batching | `heal_state_trie()`, `heal_storage_trie()` | Excessive DB writes | High | | ||
| | Busy-wait loops | Multiple locations | CPU waste | Medium | |
There was a problem hiding this comment.
These are done in scenarios where there aren't peers, but they should be reviewd.
| | Single-threaded account range processing | `request_account_range()` | Underutilizes peers | High | | ||
| | Inefficient trie node batching | `heal_state_trie()`, `heal_storage_trie()` | Excessive DB writes | High | | ||
| | Busy-wait loops | Multiple locations | CPU waste | Medium | | ||
| | Unbounded memory structures | `accounts_by_root_hash` | Memory pressure | Medium | |
There was a problem hiding this comment.
Not an issue, as measured in mainnet.
| } | ||
| ``` | ||
|
|
||
| #### 1.3.2 Dynamic Batch Sizing |
There was a problem hiding this comment.
Don't see how it would improve healing that much, should measure.
|
|
||
| ```rust | ||
| // Current (snap_sync.rs ~line 452) | ||
| loop { |
There was a problem hiding this comment.
I would review that these channels aren't doing anything else, but on principle it sounds like a good change.
|
|
||
| // Proposed | ||
| tokio::fs::create_dir_all(dir).await?; | ||
| tokio::task::spawn_blocking(move || { |
There was a problem hiding this comment.
They're already inside a spawnblocking, but they should use tokio::fs.
| ``` | ||
|
|
||
| #### 1.7.2 Request Pipelining | ||
| Increase in-flight requests for high-quality peers: |
There was a problem hiding this comment.
Good change, but could be excessive complexity.
|
|
||
| --- | ||
|
|
||
| ### 1.8 Parallel Storage Healing |
There was a problem hiding this comment.
This is not correct, storage healing is already parallelized.
PR #5975 Review Comments (fedacking) — Validated & Prioritized26 comments validated against HIGH — Bugs / Correctness1. BUG: Wrong peer capabilities in all snap client requests 2. BUG: Missing 3. Fragile index-based 4. 5. Complex state machine in storage result handling MEDIUM — Code Quality / Reliability / Memory6. Snap client methods shouldn't be extension methods on PeerHandler 7. 8. Use JoinSet instead of channels for workers 9. 10. Disk write logic duplicated 3x 11. Complex tuple types should be named structs
12. State and storage healing should share more code 13. Add comment explaining LOW — Style / Minor14. Use existing constants for magic numbers
15. 16. Rename "missing children" to "pending children" 17. Create issue to test constant values under load 18. Already fixed: 19. Acknowledged: Error handling at Recommended Order
|
ElFantasma
left a comment
There was a problem hiding this comment.
The roadmap is well-structured and comprehensive — the two-phase approach, dependency graph for the #6140 steps, and discarded-section transparency are all great.
The main issue is that the module structure table, file paths, and line numbers throughout the document are stale — they reference the pre-#5975 layout (snap/client.rs, snap/server.rs, snap/constants.rs, sync/healing/state.rs, etc.) which no longer exists after the refactor merged on Feb 6. Since this document will be the reference for ongoing work, it should reflect the current codebase. See inline comment on the module structure table for the updated layout.
snap_sync_roadmap.md:655 — The path crates/networking/p2p/snap/client.rs no longer exists after #5975. request_storage_ranges is now in sync.rs. The line numbers referenced throughout the #6140 steps section (and in sections 2.12, 2.13, 2.15, 2.17, and Appendix B) also need updating to match the current layout. Since this document is meant to guide the actual refactoring work, stale paths will cause confusion — worth a pass to update all file references.
snap_sync_roadmap.md:578 — nit: Section 2.9 says "All get_best_peer() calls in snap/client.rs" — but after #5975 these are in sync.rs. Same issue in sections 2.10 (snap/server.rs), 2.11 (snap/error.rs), and 2.14. Since these are marked as done/merged, the stale paths are less critical but still misleading if someone uses this doc as a reference.
Minor: the ASCII pipeline diagram (line 57) has inconsistent box widths (trailing spaces don't align), though this is cosmetic.
|
|
||
| ### Module Structure | ||
|
|
||
| | File | Lines | Purpose | |
There was a problem hiding this comment.
This entire table references the pre-#5975 file layout. After the refactor merged, the snap sync module is organized as:
| File | Lines | Purpose |
|---|---|---|
sync.rs |
1,658 | Main sync orchestration + account/storage range requests |
snap.rs |
1,008 | Snap protocol (client + server + errors + constants) |
sync/storage_healing.rs |
718 | Storage trie healing |
sync/state_healing.rs |
471 | State trie healing |
sync/code_collector.rs |
102 | Bytecode collection |
| Total | ~3,957 |
The snap/client.rs, snap/server.rs, snap/error.rs, snap/constants.rs split no longer exists — everything is in a single snap.rs. Same for sync/healing/ → sync/state_healing.rs and sync/storage_healing.rs.
| |------|-------------|----------|------| | ||
| | 1 | Replace `panic!` with proper error return | 2.8 | Very low | | ||
| | 2 | Replace `.expect()` with `?` operator | 2.8 | Very low | | ||
| | 3 | Extract `ensure_snapshot_dir` helper (4 occurrences) | 2.4 | Very low | |
There was a problem hiding this comment.
This path (crates/networking/p2p/snap/client.rs) no longer exists after #5975. request_storage_ranges is now in sync.rs. The line numbers referenced throughout the #6140 steps section (and in sections 2.12, 2.13, 2.15, 2.17, and Appendix B) also need updating to match the current layout.
Since this document is meant to guide the actual refactoring work, stale paths will cause confusion — worth a pass to update all file references.
|
|
||
| ### 2.10 Add `spawn_blocking` to Bytecodes Handler — ✅ DONE | ||
|
|
||
| **Status:** Merged in #5975. The function in `snap/server.rs:108-131` is `async fn` with `spawn_blocking`, matching the pattern of all other handlers. |
There was a problem hiding this comment.
nit: This says "All get_best_peer() calls in snap/client.rs" — but after #5975 these are in sync.rs. Same issue in sections 2.10 (snap/server.rs), 2.11 (snap/error.rs), and 2.14. Since these are marked as done/merged, the stale paths are less critical but still misleading if someone uses this doc as a reference.
Not an issue on mainnet as confirmed by fedacking — accounts_by_root_hash doesn't grow to gigabytes due to limited accounts with storage.
- Remove incorrect "Single-threaded account range processing" bottleneck (requests are already parallel between chunks) - Fix "Inefficient trie node batching" description (writes are already batched, note that impact needs measurement) - Clarify busy-wait loops only happen when no peers are available - Fix 1.6 Async Disk I/O description (already inside spawn_blocking, change is just for directory operations with tokio::fs) - Add complexity warning to 1.7 Adaptive Peer Timeouts
Adds a table with all 9 refactoring steps for request_storage_ranges, mapping each step to the roadmap section it addresses, with dependency graph and execution order.
Add 9 new sections (2.9-2.17) from fedacking's PR #5975 review: - 2.9: Fix snap protocol capability bug (ETH vs SNAP) - 2.10: Add spawn_blocking to bytecodes handler - 2.11: Remove dead DumpError.contents field - 2.12: Use JoinSet instead of channels for workers - 2.13: Self-contained StorageTask with hashes instead of indices - 2.14: Move snap client methods off PeerHandler - 2.15: Guard write_set in account path - 2.16: Healing code unification (generic trie healing) - 2.17: Use existing constants for magic numbers Also extend 2.1 description to include AccountStorageRoots SoA simplification and named structs for tuple types. Update timeline with new items.
…oadmap. Already fixed on the refactor/snapsync-healing-unification branch.
…oadmap. Already fixed on the refactor/snapsync-healing-unification branch.
Already fixed on the refactor/snapsync-healing-unification branch, PR #6154 for main.
…in roadmap. Already fixed on the refactor/snapsync-healing-unification branch.
…5975 instead of the now-merged refactor/snapsync-healing-unification branch.
Dismissing — my comments about stale file paths were incorrect. The paths (snap/client.rs, sync/healing/state.rs, etc.) are correct and still exist on main. #5975 did not reorganize the module structure as I claimed. Sorry for the noise.
1970015 to
97a8e7f
Compare
|
|
||
| ### 3.1 Migrate Snap Sync to Spawned Actors | ||
|
|
||
| **Current State:** `snap_sync.rs` orchestrates everything via sequential function calls. Workers are spawned with `tokio::spawn` and communicate via `mpsc::channel`. Peer acquisition uses `try_recv` + sleep busy-wait loops. |
There was a problem hiding this comment.
There is quite a bit of task that are joinsets that communicate via the return values of the task.
- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3
2026-04-15 UpdateComprehensive refresh covering work since the last update (2026-04-06). All new items are timestamped inline. New roadmap items (Phase 1)
New roadmap items (Phase 2)
Re-prioritized timelineThe pivot-update crash fix (§1.19) is now Priority 0 — every other perf win is masked by the ~20% crash rate and DB corruption on failure. Second-highest priority change: small-account batching (§1.21) identified as the largest remaining perf opportunity after #6410 (potentially 30-40% of storage phase vs ~5-6% for the big-account optimization). Updated bottleneck tableAdded three rows to "Current Performance Bottlenecks":
New risks
PR/issue status changes since last updateAll previously-tracked items remain OPEN in the same state — no merges or closes since 2026-04-06. Not-yet-added items worth considering
|
Motivation
We want to improve performance and readability in the snap sync module. This roadmap documents the strategic plan in two phases: performance optimization and code quality improvements.
Description
Adds a comprehensive roadmap document (
docs/roadmaps/snap_sync_roadmap.md) covering current state analysis, bottlenecks, and improvement items across two phases.PR Tracking
Issue #6140 — Refactor
request_storage_ranges(Steps)9-step plan to refactor
request_storage_rangesincrates/networking/p2p/snap/client.rs. Each step is one independently correct commit. Full details in Issue #6140.panic!with proper error return.expect()with?operatorensure_snapshot_dirhelper (4 occurrences)TaskTrackerstruct for task countingStorageDownloadState.process_result()accounts_doneHashMap (inline removal)try_recv+sleep) withtokio::select!Dependencies:
Execution order: 1 → 2 → 3 → 5 → 4 → 6 → 7 → 8 → 9
Merged
Discarded
Checklist
STORE_SCHEMA_VERSION(crates/storage/lib.rs) if the PR includes breaking changes to theStorerequiring a re-sync.