feat(l1): add snap sync observability endpoints and tooling#6470
feat(l1): add snap sync observability endpoints and tooling#6470ElFantasma wants to merge 33 commits intomainfrom
Conversation
…or diagnostics Add multi-layered observability for snap sync to diagnose intermittent pivot update failures on mainnet: - Prometheus metrics: eligible peers, inflight requests, pivot age, phase tracking, pivot update/storage request/header resolution outcome counters - Admin RPC: admin_peerScores (full peer table with scores, eligibility) and admin_syncStatus (phase, pivot, staleness, recent events) - Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness, pivot updates with peer pool state, error classification - Docker monitor: polls new RPC endpoints, keeps rolling snapshot buffer, dumps on degradation detection with adaptive frequency - REPL: admin.peerScores and admin.syncStatus commands
…n issues detected The docker monitor bumps log level to TRACE via admin_setLogLevel RPC when degradation is detected, capturing detailed per-peer dialogue in container logs. Restores to DEBUG on recovery. Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync events (staleness, pivot updates, error classification).
…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.
When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.
…ect healing phase - on_failure: do a final RPC poll and bump log level before dumping - _dump_snapshots: support force=True to re-dump even if already dumped - Detect healing phase as degradation trigger to get 5s polling during the high-risk window where pivot updates happen - Always capture peer state at time of failure for post-mortem analysis
The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.
…radation The monitor was trying to dump peer snapshots to the run directory during degradation events, but the directory didn't exist yet (created at run end by save_all_logs). Now set_run_id() creates the directory immediately. Also adds peer_top.py live viewer script.
🤖 Kimi Code ReviewThe PR adds comprehensive sync diagnostics and metrics. Overall structure is good, but there are several issues to address: Critical Issues
Security/Correctness
Code Quality
Minor Issues
Positive Observations
Recommendations
The RPC endpoints ( Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Lines of code reportTotal lines added: Detailed view |
🤖 Codex Code ReviewFindings
No direct EVM opcode, gas-accounting, trie, RLP, or consensus-rule changes stood out here; the risk is mainly in sync observability and the new tooling paths. I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR adds end-to-end snap-sync observability across four layers: a new Confidence Score: 5/5Safe to merge — all changes are purely additive observability; core sync logic is untouched. All three findings are P2 (style/quality suggestions): one performance nit in non-critical debug logging, one cosmetic key-name duplication in a diagnostic response, and one defensive-coding improvement in a monitoring tool. None affect correctness of sync, storage, or the RPC API. The metrics are feature-gated, the shared-Arc pattern is correctly implemented, and the new endpoints gracefully handle absent peer_handler/syncer. crates/networking/p2p/sync/snap_sync.rs (update_pivot debug overhead), crates/networking/p2p/sync_manager.rs (duplicate phase_progress key)
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Phase tracking and pivot diagnostics wired in; read_peer_diagnostics() called on every update_pivot loop iteration for debug logging, adding actor roundtrip overhead in the hot path. |
| crates/networking/p2p/sync_manager.rs | get_sync_diagnostics() correctly shares Arc with Syncer; inserts "accounts_downloaded" key while snap_sync.rs separately inserts "account_ranges_downloaded" from the same METRICS atomic — both will appear in admin_syncStatus.phase_progress. |
| crates/blockchain/metrics/sync.rs | New MetricsSync struct; correct per-call registry pattern, clean gauge/counter API. |
| crates/networking/rpc/admin/peers.rs | peer_scores and sync_status handlers correctly guard against missing peer_handler/syncer; average_score uses integer division (expected for diagnostics). |
| tooling/sync/peer_top.py | TUI correctly polls both admin endpoints; truthy-but-malformed response (missing "summary"/"peers" keys) would raise uncaught KeyError and silently exit the tool. |
| tooling/sync/docker_monitor.py | DiagnosticsTracker correctly uses rolling buffers, degradation detection, and log level escalation; run_id guard and force-dump on failure are handled properly. |
| crates/networking/p2p/sync.rs | SyncDiagnostics struct and bounded deques for pivot/error history are clean; Arc shared correctly between Syncer and SyncManager. |
| crates/networking/rpc/rpc.rs | Two new admin routes correctly registered; no other changes. |
| tooling/repl/src/formatter.rs | Array-of-objects table rendering is additive and well-scoped; no logic issues. |
| tooling/repl/src/commands/admin.rs | New peerScores and syncStatus REPL commands wired correctly to the new RPC methods. |
Sequence Diagram
sequenceDiagram
participant Client as RPC Client
participant RPC as admin RPC (rpc.rs)
participant SM as SyncManager
participant Diag as Arc<RwLock<SyncDiagnostics>>
participant Snap as snap_sync (Syncer)
participant PT as PeerTable (actor)
Note over Snap,Diag: During sync cycle
Snap->>Diag: write() — current_phase, pivot_block, phase_progress
Snap->>Diag: push_pivot_change() / push_error()
Note over Client,PT: admin_syncStatus RPC
Client->>RPC: admin_syncStatus
RPC->>SM: get_sync_diagnostics()
SM->>Diag: read().clone()
SM->>SM: override pivot_age_seconds (live)
SM->>SM: merge phase_progress from METRICS atomics
SM-->>RPC: SyncDiagnostics
RPC-->>Client: JSON response
Note over Client,PT: admin_peerScores RPC
Client->>RPC: admin_peerScores
RPC->>PT: get_peer_diagnostics()
PT-->>RPC: Vec<PeerDiagnostics>
RPC->>RPC: compute summary (total, eligible, avg_score)
RPC-->>Client: JSON {peers, summary}
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 779-788
Comment:
**Full peer diagnostics fetch inside hot retry loop**
`read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/networking/p2p/sync_manager.rs
Line: 154-156
Comment:
**Duplicate phase-progress key alongside `"account_ranges_downloaded"`**
`snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tooling/sync/peer_top.py
Line: 140-141
Comment:
**Unchecked dict access will crash the TUI on unexpected RPC response**
`data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently.
```suggestion
s = data.get("summary", {})
peers = data.get("peers", [])
if not s or peers is None:
lines.append(f"{RED}Unexpected response from node{RESET}")
return lines
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(l1): create run directory early so s..." | Re-trigger Greptile
| let diag = peers.read_peer_diagnostics().await; | ||
| let eligible_count = diag.iter().filter(|p| p.eligible).count(); | ||
| let total_count = diag.len(); | ||
| debug!( | ||
| eligible_peers = eligible_count, | ||
| total_peers = total_count, | ||
| selected_peer = %peer_id, | ||
| peer_score = peer_score, | ||
| consecutive_failures = consecutive_failures, | ||
| total_failures = total_failures, |
There was a problem hiding this comment.
Full peer diagnostics fetch inside hot retry loop
read_peer_diagnostics() performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the update_pivot loop (up to MAX_TOTAL_FAILURES = 15 times, and called again on each healing-phase pivot refresh). The data is used only for a debug! log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a tracing::enabled!(Level::DEBUG) check.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 779-788
Comment:
**Full peer diagnostics fetch inside hot retry loop**
`read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check.
How can I resolve this? If you propose a fix, please make it concise.| if accounts_downloaded > 0 { | ||
| diag.phase_progress | ||
| .insert("accounts_downloaded".into(), accounts_downloaded); |
There was a problem hiding this comment.
Duplicate phase-progress key alongside
"account_ranges_downloaded"
snap_sync.rs (line 356) writes "account_ranges_downloaded" to phase_progress from the same METRICS.downloaded_account_tries atomic. get_sync_diagnostics() then additionally inserts "accounts_downloaded" from the same source. After admin_syncStatus is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. "accounts_downloaded") and remove the other write.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync_manager.rs
Line: 154-156
Comment:
**Duplicate phase-progress key alongside `"account_ranges_downloaded"`**
`snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write.
How can I resolve this? If you propose a fix, please make it concise.| s = data["summary"] | ||
| peers = data["peers"] |
There was a problem hiding this comment.
Unchecked dict access will crash the TUI on unexpected RPC response
data being truthy only means the result field was present and non-null; it doesn't guarantee the "summary" or "peers" keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). data["summary"] will raise KeyError, which propagates to the outer except Exception: cleanup() and exits the tool silently.
| s = data["summary"] | |
| peers = data["peers"] | |
| s = data.get("summary", {}) | |
| peers = data.get("peers", []) | |
| if not s or peers is None: | |
| lines.append(f"{RED}Unexpected response from node{RESET}") | |
| return lines |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tooling/sync/peer_top.py
Line: 140-141
Comment:
**Unchecked dict access will crash the TUI on unexpected RPC response**
`data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently.
```suggestion
s = data.get("summary", {})
peers = data.get("peers", [])
if not s or peers is None:
lines.append(f"{RED}Unexpected response from node{RESET}")
return lines
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Claude Code ReviewHere is my review of PR #6470: Review:
|
…ixes - fix clippy redundant closure in formatter.rs - fix unnecessary u64 casts in sync_manager.rs - migrate MetricsSync to default prometheus registry (Pattern B) so new sync metrics register once at init instead of per-gather. Removes gather_metrics() — exported via gather_default_metrics() automatically - fix admin_syncStatus reporting stale phase after recoverable error - fix on_failure() poll bypass — add force param so final poll executes even when instance status is already "failed" - fix peer_top.sh Python 3.12+ f-string syntax — use str.format() - remove dead DEGRADATION_STALL_TIMEOUT constant
Review feedback addressedThanks for the thorough reviews. Here's how each finding was handled. Fixed
Not addressing (with reasoning)
|
Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set. Phases in this set get TRACE logging and fast polling — useful for investigating specific sync stages without editing the script. Default: healing (current investigation target). Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"
Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the observability PR, with improvements: - Add progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total) + stage + pivot_block - Push from METRICS atomics via push_sync_prometheus_metrics() in network.rs, called each polling cycle and on phase completion - Grafana dashboard with 7 rows: overview, peer health, headers, accounts, storage, healing, bytecodes — with progress gauges, rate panels (using Grafana rate() instead of app-computed rates), and ETA - All metrics use default Prometheus registry (register at init) - New peer-health row with eligible peers, pivot age, inflight requests, and pivot update outcomes — not present in the original PR Supersedes #6468.
- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on
each phase transition, persists for completed phases
- Grafana computes elapsed as time() - timestamp, per phase
- Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live)
- pivot_age_seconds also updated each push cycle for RPC/peer_top
- Added ETA panels for headers, accounts, bytecodes (remaining/rate)
- Added elapsed panels for all phases in their respective rows
- Overview row: replaced Throughput (N/A during sync) with Phase Elapsed
showing all active/completed phase timings
…ync, not only after
…ing_syncing covers it
…h function in network.rs covers them
…apsed per phase + full-width timeseries
- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3
| .duration_since(std::time::UNIX_EPOCH) | ||
| .unwrap_or_default() | ||
| .as_secs(), | ||
| error_type: format!("{:?}", std::mem::discriminant(&error)), |
There was a problem hiding this comment.
Not sure if something like 'Discriminant(7)' is that useful
avilagaston9
left a comment
There was a problem hiding this comment.
Solid observability PR — well-structured across all 5 layers. Left a few inline comments on things I noticed.
| .zip(&col_widths) | ||
| .map(|(val, w)| { | ||
| let truncated = if val.len() > *w { | ||
| format!("{}…", &val[..*w - 1]) |
There was a problem hiding this comment.
val[..*w - 1] slices by byte index, not character index. If val contains multi-byte UTF-8 (peer client versions sometimes include emoji or non-ASCII), this panics with byte index is not a char boundary.
Suggestion:
let truncated = if val.len() > *w {
let end = val.char_indices()
.take(*w - 1)
.last()
.map(|(i, c)| i + c.len_utf8())
.unwrap_or(0);
format!("{}…", &val[..end])Or simpler: val.chars().take(w - 1).collect::<String>() (allocates, but this isn't a hot path).
| Err(error) => { | ||
| match error.is_recoverable() { | ||
| let recoverable = error.is_recoverable(); | ||
| self.diagnostics.write().await.current_phase = "idle".to_string(); |
There was a problem hiding this comment.
Two separate write-lock acquisitions here and on the .push_error() call below. If diagnostics are read between the two locks (e.g. by the RPC handler), the phase shows idle but the error event is missing. Consider combining into a single lock:
{
let mut diag = self.diagnostics.write().await;
diag.current_phase = "idle".to_string();
diag.push_error(SyncErrorEvent { ... });
}The debug! macro between them doesn't need the lock.
Motivation
Recent mainnet-9 multisync runs surfaced intermittent snap sync failures that are hard to diagnose from standard logs — in particular, pivot-update failures fired during the healing transition where peer-selection bottlenecks lead to
process::exit(2). To investigate and eventually prevent these, we need tools to (a) inspect live sync state from outside the node, (b) capture detailed peer diagnostics when something degrades, and (c) post-mortem a failure with full peer-table context.Description
End-to-end observability for snap sync, across five layers.
Node — sync metrics (new
crates/blockchain/metrics/sync.rs):Node — RPC endpoints:
admin_syncStatus: current phase, pivot block, staleness info, phase progress. Resets toidleon both success and error paths.admin_peerScores: full peer table with scores, capabilities, supported block ranges, and per-capability eligibility. Computed live per query (not read from a possibly-stale snapshot)snap_sync.rsGrafana dashboard (
metrics/provisioning/grafana/dashboards/common_dashboards/snapsync_dashboard.json):Monitor (
tooling/sync/docker_monitor.py):--watched-phases): sync phases that trigger TRACE logging and fast polling (5s). Empty by default — opt-in, not opt-out. UseMULTISYNC_WATCHED_PHASES=healingin the Makefile or--watched-phases "healing"directly.admin_setLogLevel, snapshots dumped to disk; restored on recoveryLive TUI (
tooling/sync/peer_top.py):REPL (
tooling/repl/src/formatter.rs+ admin commands):admin_peerScores)Note: This PR incorporates the snap sync progress metrics and Grafana dashboard from #6468 (Tomi/Esteve), with adaptations: metrics use the default Prometheus registry, throughput rates are computed in Grafana via rate() instead of in-app, and a new peer-health row was added. I closed #6468, as it was superseded.
Checklist
make multisync-loop-autoon ethrex-mainnet-9 (mainnet, sepolia, hoodi in parallel)