feat(l1): add snap sync observability endpoints and tooling by ElFantasma · Pull Request #6470 · lambdaclass/ethrex

ElFantasma · 2026-04-13T21:13:44Z

Motivation

Recent mainnet-9 multisync runs surfaced intermittent snap sync failures that are hard to diagnose from standard logs — in particular, pivot-update failures fired during the healing transition where peer-selection bottlenecks lead to process::exit(2). To investigate and eventually prevent these, we need tools to (a) inspect live sync state from outside the node, (b) capture detailed peer diagnostics when something degrades, and (c) post-mortem a failure with full peer-table context.

Description

End-to-end observability for snap sync, across five layers.

Node — sync metrics (new crates/blockchain/metrics/sync.rs):

Per-phase progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total)
Peer health: eligible peers, snap peers, inflight requests
Pivot tracking: block number, timestamp, age
Phase timing: per-phase start timestamps for elapsed calculation
Outcome counters: pivot updates, storage requests, header resolution (by outcome label)
Registered into the default Prometheus registry at init — no per-request overhead

Node — RPC endpoints:

admin_syncStatus: current phase, pivot block, staleness info, phase progress. Resets to idle on both success and error paths.
admin_peerScores: full peer table with scores, capabilities, supported block ranges, and per-capability eligibility. Computed live per query (not read from a possibly-stale snapshot)
Additional header-download-phase diagnostics in snap_sync.rs

Grafana dashboard (metrics/provisioning/grafana/dashboards/common_dashboards/snapsync_dashboard.json):

- [Live dashboard on ethrex-grafana](https://grafana.ethrex.xyz/d/ethrex-snapsync/ethrex-snap-sync) - 7 sections: Sync Overview, Peer Health, Headers, Accounts, Storage, Healing, Bytecodes - Each phase section: progress gauge with downloaded/total counts, rate gauge, ETA, elapsed time, full-width rate-over-time timeseries - Peer Health row: eligible peers, pivot age (live from timestamp), inflight requests, pivot update count - All rates computed in Grafana via `rate(metric[5m])` — no in-app rate computation - ETAs computed as `remaining / rate` with division-by-zero guards - Multi-instance support via `$instance` variable (mainnet:3701, hoodi:3702, sepolia:3703) - Docker compose updated to expose metrics ports and enable `--metrics` on all containers

Monitor (tooling/sync/docker_monitor.py):

Rolling snapshot buffer per instance, dumped to run directory on demand
Degradation detection: low eligible peers, high staleness ratio
Configurable watched phases (--watched-phases): sync phases that trigger TRACE logging and fast polling (5s). Empty by default — opt-in, not opt-out. Use MULTISYNC_WATCHED_PHASES=healing in the Makefile or --watched-phases "healing" directly.
On degradation: poll drops to 5s, log level bumps to TRACE via admin_setLogLevel, snapshots dumped to disk; restored on recovery
Force-dump on any failure for post-mortem (with final RPC poll to capture peer state at failure time)

Live TUI (tooling/sync/peer_top.py):

- Top-style live view of sync status + peer table sorted by score - Responsive layout — adapts to terminal width, trims client names when narrow - Score delta arrows (green ↑ / red ↓) to spot scoring changes at a glance - Polls `admin_syncStatus` and `admin_peerScores`

REPL (tooling/repl/src/formatter.rs + admin commands):

Table rendering for arrays of objects (needed to read the peer table returned by admin_peerScores)
Admin commands wired to the new endpoints

Note: This PR incorporates the snap sync progress metrics and Grafana dashboard from #6468 (Tomi/Esteve), with adaptations: metrics use the default Prometheus registry, throughput rates are computed in Grafana via rate() instead of in-app, and a new peer-health row was added. I closed #6468, as it was superseded.

Checklist

Tested with make multisync-loop-auto on ethrex-mainnet-9 (mainnet, sepolia, hoodi in parallel)
AI agent review feedback addressed (see comment)
Grafana dashboard tested with live data from 3 networks

…or diagnostics Add multi-layered observability for snap sync to diagnose intermittent pivot update failures on mainnet: - Prometheus metrics: eligible peers, inflight requests, pivot age, phase tracking, pivot update/storage request/header resolution outcome counters - Admin RPC: admin_peerScores (full peer table with scores, eligibility) and admin_syncStatus (phase, pivot, staleness, recent events) - Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness, pivot updates with peer pool state, error classification - Docker monitor: polls new RPC endpoints, keeps rolling snapshot buffer, dumps on degradation detection with adaptive frequency - REPL: admin.peerScores and admin.syncStatus commands

…n issues detected The docker monitor bumps log level to TRACE via admin_setLogLevel RPC when degradation is detected, capturing detailed per-peer dialogue in container logs. Restores to DEBUG on recovery. Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync events (staleness, pivot updates, error classification).

…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.

When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.

…ect healing phase - on_failure: do a final RPC poll and bump log level before dumping - _dump_snapshots: support force=True to re-dump even if already dumped - Detect healing phase as degradation trigger to get 5s polling during the high-risk window where pivot updates happen - Always capture peer state at time of failure for post-mortem analysis

The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.

…radation The monitor was trying to dump peer snapshots to the run directory during degradation events, but the directory didn't exist yet (created at run end by save_all_logs). Now set_run_id() creates the directory immediately. Also adds peer_top.py live viewer script.

github-actions · 2026-04-13T21:15:50Z

🤖 Kimi Code Review

The PR adds comprehensive sync diagnostics and metrics. Overall structure is good, but there are several issues to address:

Critical Issues

Inefficient metrics gathering (crates/blockchain/metrics/sync.rs:140-158)
Creating a new Registry and re-registering all metrics on every gather_metrics() call is expensive and unnecessary. The Prometheus registry should be static or cached.
```
// Current: Creates new registry every time
pub fn gather_metrics(&self) -> Result<String, MetricsError> {
    let r = Registry::new();  // Don't do this
    // ... register all metrics ...
}
```
Missing import in Python (tooling/sync/docker_monitor.py)
Optional is used in type hints but not imported from typing. Add:
```
from typing import Optional
```

Security/Correctness

Potential lock poisoning (crates/networking/p2p/sync/snap_sync.rs:337)
While tokio::sync::RwLock doesn't poison on panic, holding the write lock while doing string allocations could briefly block readers. Consider using explicit scopes consistently (as done at line 345) to minimize lock duration.
Integer casting (crates/networking/p2p/network.rs:730-732)
Casting usize to i64 for metrics could overflow on 64-bit platforms with extreme peer counts (unlikely but worth noting). Use try_from or saturating cast.

Code Quality

Feature flag consistency (crates/blockchain/metrics/mod.rs:17)
The sync module is guarded by any(feature = "api", feature = "metrics") but all usage sites use only feature = "metrics". Consider aligning these or document why the API feature alone would need the sync metrics module.
Error handling in UTF-8 conversion (crates/blockchain/metrics/sync.rs:158)
While Prometheus text encoder should produce valid UTF-8, the ? operator converts FromUtf8Error to MetricsError. Ensure MetricsError has a variant for this (appears to via PrometheusErr).

Minor Issues

Python type hints (tooling/sync/docker_monitor.py:61+)
Uses dict[str, ...] and list[dict] which require Python 3.9+. If supporting older versions, use Dict/List from typing.
Debug log formatting (crates/networking/p2p/sync/snap_sync.rs:843-849)
The debug! macro call spans multiple lines with complex formatting. Consider extracting the pivot_age calculation to a variable before the log call for readability.

Positive Observations

Good use of Arc<RwLock<SyncDiagnostics>> for shared mutable state
Proper feature-gating of metrics code with #[cfg(feature = "metrics")]
saturating_sub used correctly for timestamp arithmetic to prevent underflow
PeerDiagnostics provides useful visibility without exposing sensitive peer data
The peer_top.py tool is well-structured with proper terminal handling

Recommendations

Cache the Prometheus Registry in MetricsSync instead of creating it in gather_metrics
Add from typing import Optional to docker_monitor.py if not present
Consider using parking_lot::RwLock instead of tokio::sync::RwLock for SyncDiagnostics if the lock is held across very short critical sections only (not async), though the current choice is fine for async contexts

The RPC endpoints (admin_peerScores, admin_syncStatus) are appropriately placed in the admin namespace and don't expose sensitive information.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-13T21:17:13Z

Lines of code report

Total lines added: 604
Total lines removed: 0
Total lines changed: 604

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/mod.rs        | 54    | +2   |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/sync.rs       | 174   | +174 |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/metrics.rs        | 618   | +2   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/network.rs        | 747   | +92  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 573   | +22  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1035  | +49  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs    | 1181  | +13  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs           | 309   | +65  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1126  | +107 |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync_manager.rs   | 190   | +45  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/rpc/admin/peers.rs    | 182   | +31  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/rpc/rpc.rs            | 1013  | +2   |
+------------------------------------------------+-------+------+

github-actions · 2026-04-13T21:18:17Z

🤖 Codex Code Review

Findings

admin_syncStatus can get stuck reporting an active phase after a recoverable sync failure. The success path resets current_phase to "idle", but the error path only records recent_errors and leaves the previous phase intact. That means a node that has already stopped syncing can still report "healing"/"bytecodes" indefinitely, which is misleading for operators and interacts badly with the new monitor. Reset the phase on all exits, or introduce an explicit retry/error state. sync.rs sync.rs
DiagnosticsTracker.on_failure() never performs the “final poll” it claims to do. It sets last_poll[name] = 0 and calls self.poll(inst), but poll() immediately returns for inst.status in ("success", "failed", "waiting"). In practice, failure snapshots will be dumped without the last peer/sync RPC state, which is the most useful moment to capture. docker_monitor.py docker_monitor.py
tooling/sync/peer_top.sh is broken as committed. The embedded Python uses invalid f-strings with nested double quotes, for example f"Peers: {s["total_peers"]} ...", so it raises SyntaxError before producing any output. I verified the embedded snippet does not compile. peer_top.sh peer_top.sh peer_top.sh

No direct EVM opcode, gas-accounting, trie, RLP, or consensus-rule changes stood out here; the risk is mainly in sync observability and the new tooling paths.

I could not run cargo check in this sandbox because rustup tried to write under a read-only path. I did validate the touched Python files with py_compile.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-04-13T21:19:22Z

Greptile Summary

This PR adds end-to-end snap-sync observability across four layers: a new MetricsSync Prometheus metrics module, two new admin RPC endpoints (admin_peerScores, admin_syncStatus), a DiagnosticsTracker-enhanced Docker monitor, and a live TUI (peer_top.py). The changes are purely additive; core sync logic is unmodified, and all metrics recording is gated behind #[cfg(feature = "metrics")].

Confidence Score: 5/5

Safe to merge — all changes are purely additive observability; core sync logic is untouched.

All three findings are P2 (style/quality suggestions): one performance nit in non-critical debug logging, one cosmetic key-name duplication in a diagnostic response, and one defensive-coding improvement in a monitoring tool. None affect correctness of sync, storage, or the RPC API. The metrics are feature-gated, the shared-Arc pattern is correctly implemented, and the new endpoints gracefully handle absent peer_handler/syncer.

crates/networking/p2p/sync/snap_sync.rs (update_pivot debug overhead), crates/networking/p2p/sync_manager.rs (duplicate phase_progress key)

Important Files Changed

Filename	Overview
crates/networking/p2p/sync/snap_sync.rs	Phase tracking and pivot diagnostics wired in; `read_peer_diagnostics()` called on every update_pivot loop iteration for debug logging, adding actor roundtrip overhead in the hot path.
crates/networking/p2p/sync_manager.rs	get_sync_diagnostics() correctly shares Arc with Syncer; inserts "accounts_downloaded" key while snap_sync.rs separately inserts "account_ranges_downloaded" from the same METRICS atomic — both will appear in admin_syncStatus.phase_progress.
crates/blockchain/metrics/sync.rs	New MetricsSync struct; correct per-call registry pattern, clean gauge/counter API.
crates/networking/rpc/admin/peers.rs	peer_scores and sync_status handlers correctly guard against missing peer_handler/syncer; average_score uses integer division (expected for diagnostics).
tooling/sync/peer_top.py	TUI correctly polls both admin endpoints; truthy-but-malformed response (missing "summary"/"peers" keys) would raise uncaught KeyError and silently exit the tool.
tooling/sync/docker_monitor.py	DiagnosticsTracker correctly uses rolling buffers, degradation detection, and log level escalation; run_id guard and force-dump on failure are handled properly.
crates/networking/p2p/sync.rs	SyncDiagnostics struct and bounded deques for pivot/error history are clean; Arc shared correctly between Syncer and SyncManager.
crates/networking/rpc/rpc.rs	Two new admin routes correctly registered; no other changes.
tooling/repl/src/formatter.rs	Array-of-objects table rendering is additive and well-scoped; no logic issues.
tooling/repl/src/commands/admin.rs	New peerScores and syncStatus REPL commands wired correctly to the new RPC methods.

Sequence Diagram

sequenceDiagram
    participant Client as RPC Client
    participant RPC as admin RPC (rpc.rs)
    participant SM as SyncManager
    participant Diag as Arc<RwLock<SyncDiagnostics>>
    participant Snap as snap_sync (Syncer)
    participant PT as PeerTable (actor)

    Note over Snap,Diag: During sync cycle
    Snap->>Diag: write() — current_phase, pivot_block, phase_progress
    Snap->>Diag: push_pivot_change() / push_error()

    Note over Client,PT: admin_syncStatus RPC
    Client->>RPC: admin_syncStatus
    RPC->>SM: get_sync_diagnostics()
    SM->>Diag: read().clone()
    SM->>SM: override pivot_age_seconds (live)
    SM->>SM: merge phase_progress from METRICS atomics
    SM-->>RPC: SyncDiagnostics
    RPC-->>Client: JSON response

    Note over Client,PT: admin_peerScores RPC
    Client->>RPC: admin_peerScores
    RPC->>PT: get_peer_diagnostics()
    PT-->>RPC: Vec<PeerDiagnostics>
    RPC->>RPC: compute summary (total, eligible, avg_score)
    RPC-->>Client: JSON {peers, summary}

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 779-788

Comment:
**Full peer diagnostics fetch inside hot retry loop**

`read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/sync_manager.rs
Line: 154-156

Comment:
**Duplicate phase-progress key alongside `"account_ranges_downloaded"`**

`snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tooling/sync/peer_top.py
Line: 140-141

Comment:
**Unchecked dict access will crash the TUI on unexpected RPC response**

`data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently.

```suggestion
    s = data.get("summary", {})
    peers = data.get("peers", [])
    if not s or peers is None:
        lines.append(f"{RED}Unexpected response from node{RESET}")
        return lines
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(l1): create run directory early so s..." | Re-trigger Greptile}

greptile-apps · 2026-04-13T21:19:26Z

+        let diag = peers.read_peer_diagnostics().await;
+        let eligible_count = diag.iter().filter(|p| p.eligible).count();
+        let total_count = diag.len();
+        debug!(
+            eligible_peers = eligible_count,
+            total_peers = total_count,
+            selected_peer = %peer_id,
+            peer_score = peer_score,
+            consecutive_failures = consecutive_failures,
+            total_failures = total_failures,


Full peer diagnostics fetch inside hot retry loop

read_peer_diagnostics() performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the update_pivot loop (up to MAX_TOTAL_FAILURES = 15 times, and called again on each healing-phase pivot refresh). The data is used only for a debug! log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a tracing::enabled!(Level::DEBUG) check.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 779-788 Comment: **Full peer diagnostics fetch inside hot retry loop** `read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-13T21:19:27Z

+        if accounts_downloaded > 0 {
+            diag.phase_progress
+                .insert("accounts_downloaded".into(), accounts_downloaded);


Duplicate phase-progress key alongside "account_ranges_downloaded"

snap_sync.rs (line 356) writes "account_ranges_downloaded" to phase_progress from the same METRICS.downloaded_account_tries atomic. get_sync_diagnostics() then additionally inserts "accounts_downloaded" from the same source. After admin_syncStatus is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. "accounts_downloaded") and remove the other write.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync_manager.rs Line: 154-156 Comment: **Duplicate phase-progress key alongside `"account_ranges_downloaded"`** `snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-13T21:19:28Z

+    s = data["summary"]
+    peers = data["peers"]


Unchecked dict access will crash the TUI on unexpected RPC response

data being truthy only means the result field was present and non-null; it doesn't guarantee the "summary" or "peers" keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). data["summary"] will raise KeyError, which propagates to the outer except Exception: cleanup() and exits the tool silently.

Suggested change

s = data["summary"]

peers = data["peers"]

s = data.get("summary", {})

peers = data.get("peers", [])

if not s or peers is None:

lines.append(f"{RED}Unexpected response from node{RESET}")

return lines

Prompt To Fix With AI

This is a comment left during a code review. Path: tooling/sync/peer_top.py Line: 140-141 Comment: **Unchecked dict access will crash the TUI on unexpected RPC response** `data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently. ```suggestion s = data.get("summary", {}) peers = data.get("peers", []) if not s or peers is None: lines.append(f"{RED}Unexpected response from node{RESET}") return lines ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-13T21:20:42Z

🤖 Claude Code Review

Here is my review of PR #6470:

Review: `feat(l1): add snap sync observability endpoints and tooling`

The motivation is solid — intermittent snap sync failures on mainnet are genuinely hard to diagnose post-hoc, and the combination of admin RPCs, auto-escalating log verbosity, a TUI, and Prometheus metrics gives operators a useful toolkit. The overall architecture is coherent. There are a few correctness and design issues worth addressing.

High severity

1. peer_top.sh: Python 3.12+ only syntax

The embedded Python in peer_top.sh uses f-string nested dict-key quotes:

print(f"Peers: {s["total_peers"]}  Eligible: {s["eligible_peers"]} ...")
print(f"{"Peer ID":>14} {"Score":>6} ...")
for p in ...:
    print(f"{p["score"]:>6} {p["inflight_requests"]:>5} ...")

Nested quotes inside f-string expressions (f"{s["key"]}") are only valid in Python 3.12+. On Python 3.10/3.11 (still in very widespread use, including most LTS distros), this raises SyntaxError before the script even runs. Fix by using the alternate quote style inside the f-string: f"{s['total_peers']}", or assign column header strings to variables before formatting.

2. Healing phase unconditionally triggers degradation (docker_monitor.py)

if phase == "healing":
    reasons.append("healing_phase")

This fires on every single poll during healing, which is a normal sync phase that can last hours on mainnet. Consequences:

The node stays in TRACE logging mode for the entire healing phase — enormous log volume
Poll frequency drops to 5s (6× overhead) for hours
Recovery is permanently blocked while in healing, since healthy_since only advances when reasons is empty

At minimum this should be combined with another condition (e.g., eligible_peers < threshold), or separated into a "increase verbosity" path that doesn't activate the full degradation response.

3. Progress stall detection is dead code (docker_monitor.py)

DEGRADATION_STALL_TIMEOUT = 60 is defined and never referenced. The stall check inside _check_alert_conditions:

if self._last_progress[name] is not None and self._last_progress[name] == progress_key:
    pass  # tracked by the outer loop via last_block

detects a stall but immediately passes without adding to reasons. The comment implies this is intentionally deferred to some other code, but no degradation is triggered by stalls anywhere. If this is an incomplete feature, remove the dead constant and the unreachable detection block to avoid confusion.

Medium severity

4. _dump_snapshots silently skips subsequent degradation events

dumped_for_run[name] is set to True after the first dump and never reset mid-run. Combined with the healing-phase issue above, the most diagnostic data (from late in the run when failures are actually occurring) is never captured. Subsequent calls to _dump_snapshots return immediately:

if not force and self.dumped_for_run.get(name):
    return

At minimum, log a message when a dump is skipped (print(f"[{name}] Skipping snapshot dump (already dumped for this run)")). Consider whether the intent is "one dump per run" (current behavior) or "dump on each distinct degradation event."

5. Fragile table section detection in format_object_box (formatter.rs)

if value.starts_with('\n') && value.contains("items)") {
    table_sections.push((key, value));
}

This couples format_object_box's parsing to the specific footer string of format_object_array_table. If the footer format changes, or if any scalar string value happens to match (e.g., a multi-line peer status message containing "items)"), the layout silently misbehaves. A typed discriminant would be more robust:

enum RowValue { Scalar(String), Table(String) }

This would let flatten_object return typed variants instead of relying on string inspection.

Low severity

6. gather_metrics() clones values at clone time

Each call to gather_metrics() constructs a fresh Registry and re-registers cloned metrics. The prometheus Clone for IntGauge/IntCounterVec shares the underlying Arc<AtomicI64>, so this actually does see live values — but the re-registration work is done on every scrape. This is consistent with how other metrics modules in this codebase work, so it's acceptable, but a brief comment explaining the "register fresh on each gather" pattern would help future contributors who expect the standard "register once at startup" approach.

7. Integer truncation for avg_score (admin/peers.rs)

let avg_score = if total > 0 {
    diagnostics.iter().map(|p| p.score).sum::<i64>() / total as i64
} else { 0 };

Integer division truncates rather than rounds. Fine for an observability endpoint, but worth noting that with scores clustered near a threshold this can mislead (e.g., 5 peers with scores [4,4,4,4,5] reports avg 4 rather than 4.2).

8. Timestamp 0 fallback in record_success (peer_table.rs)

.duration_since(std::time::UNIX_EPOCH)
.unwrap_or_default()
.as_secs()

If the system clock is behind the epoch (misconfigured clock), unwrap_or_default() silently records last_response_time = Some(0). Downstream consumers (e.g., peer_top.py displaying this value) would show "1970-01-01" rather than "never responded." A sentinel like u64::MAX or keeping None on error would be less confusing.

9. PeerDiagnostics.ip exposure

The ip field is serialized into admin_peerScores responses. Admin endpoints are presumably firewalled, but a comment in the struct acknowledging the sensitivity would be appropriate — peer IP addresses can be used for targeted denial-of-service against specific network participants.

10. Arc<RwLock<SyncDiagnostics>> threading depth

The diagnostics arc is now passed through: sync_cycle_snap → snap_sync → request_account_range → update_pivot. This is four levels of parameter threading for what is essentially a write-only observability sink. Given that METRICS_SYNC is already a LazyLock global (accessed directly without threading), a OnceLock<Arc<RwLock<SyncDiagnostics>>> global initialized at Syncer::new() would be consistent with the existing pattern and would avoid polluting all these function signatures.

Nits

PivotChangeEvent.old_pivot_number is u64 (non-optional), so the first pivot change records old_pivot_number: 0. A Option<u64> would be more semantically accurate for the "no prior pivot" case.
SyncDiagnostics.phase_progress: HashMap<String, u64> — JSON object key order is non-deterministic. Fine for human-readable output, but worth noting if tests ever assert on the serialized form.
The peer_top.py TUI is well-executed — color-coded staleness, score delta arrows, and capability grouping are all useful at a glance.

Automated review by Claude (Anthropic) · sonnet · custom prompt

…ixes - fix clippy redundant closure in formatter.rs - fix unnecessary u64 casts in sync_manager.rs - migrate MetricsSync to default prometheus registry (Pattern B) so new sync metrics register once at init instead of per-gather. Removes gather_metrics() — exported via gather_default_metrics() automatically - fix admin_syncStatus reporting stale phase after recoverable error - fix on_failure() poll bypass — add force param so final poll executes even when instance status is already "failed" - fix peer_top.sh Python 3.12+ f-string syntax — use str.format() - remove dead DEGRADATION_STALL_TIMEOUT constant

…l tool

ElFantasma · 2026-04-14T11:33:20Z

Review feedback addressed

Thanks for the thorough reviews. Here's how each finding was handled.

Fixed

Finding	Raised by	Fix
Clippy: redundant closure in `formatter.rs:187`	CI lint	`.map(\|v\| inline_value(v))` → `.map(inline_value)`
Clippy: unnecessary `u64` casts in `sync_manager.rs`	CI lint	Removed 3 redundant `as u64` casts
`peer_top.sh` broken on Python < 3.12 (nested f-string quotes)	Claude, Codex	Removed `peer_top.sh` entirely — `peer_top.py` is the canonical tool
`MetricsSync` creates new `Registry` every `gather_metrics()` call	Kimi	Migrated to default Prometheus registry: metrics register once at init via `register_int_gauge!` / `register_int_counter_vec!`, exported via `gather_default_metrics()`. Removed `gather_metrics()` entirely. The same per-call pattern exists in 6 other modules — migrating those is out of scope but this sets the example.
`admin_syncStatus` reports stale phase after recoverable sync error	Codex	Reset `current_phase = "idle"` on the error path in `sync.rs`
`on_failure()` never executes the "final poll"	Codex	Added `force: bool` param to `poll()`. `on_failure()` now calls `poll(inst, force=True)` to bypass the status check
Dead `DEGRADATION_STALL_TIMEOUT` constant	Claude	Removed constant and stale comment reference
Healing phase unconditionally triggers degradation (TRACE + fast polling for hours)	Claude	Made configurable: `WATCHED_PHASES` set (empty by default — opt-in, not opt-out). Override via `--watched-phases "healing"` CLI flag or `MULTISYNC_WATCHED_PHASES=healing` Makefile variable. No TRACE logging unless explicitly requested.
Progress stall detection is dead code	Claude	Removed the dead constant and updated comment to `# stall detection not yet implemented`

Not addressing (with reasoning)

read_peer_diagnostics() in update_pivot loop adds actor roundtrip overhead (Greptile) — This is in a debug logging path that only fires during pivot updates (rare, seconds-scale operations). The overhead of one actor call is negligible relative to the 15s peer timeout per attempt.
Duplicate phase_progress keys (accounts_downloaded vs account_ranges_downloaded) (Greptile) — These are intentionally distinct: account_ranges_downloaded counts accounts received from peers (network phase), while accounts_downloaded counts accounts committed to temp storage (insertion phase). They track different stages of the pipeline.
peer_top.py uncaught KeyError on malformed response (Greptile) — This is a dev/debugging tool, not a production service. Crashing on malformed data is the correct behavior — it surfaces issues rather than hiding them.
Lock poisoning / integer casting / feature flag alignment (Kimi) — tokio::sync::RwLock doesn't poison. usize → i64 for peer counts won't overflow. Feature flag follows the existing pattern in mod.rs.
Python type hints use dict[str, ...] / list[dict] (Python 3.9+) (Kimi) — Our Docker images and development environments use Python 3.10+. Python 3.8 is EOL. Acceptable minimum.

Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set. Phases in this set get TRACE logging and fast polling — useful for investigating specific sync stages without editing the script. Default: healing (current investigation target). Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"

Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the observability PR, with improvements: - Add progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total) + stage + pivot_block - Push from METRICS atomics via push_sync_prometheus_metrics() in network.rs, called each polling cycle and on phase completion - Grafana dashboard with 7 rows: overview, peer health, headers, accounts, storage, healing, bytecodes — with progress gauges, rate panels (using Grafana rate() instead of app-computed rates), and ETA - All metrics use default Prometheus registry (register at init) - New peer-health row with eligible peers, pivot age, inflight requests, and pivot update outcomes — not present in the original PR Supersedes #6468.

… 3701

…) - timestamp

- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on each phase transition, persists for completed phases - Grafana computes elapsed as time() - timestamp, per phase - Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live) - pivot_age_seconds also updated each push cycle for RPC/peer_top - Added ETA panels for headers, accounts, bytecodes (remaining/rate) - Added elapsed panels for all phases in their respective rows - Overview row: replaced Throughput (N/A during sync) with Phase Elapsed showing all active/completed phase timings

…ync, not only after

…ing_syncing covers it

…h function in network.rs covers them

…702/3703)

…apsed per phase + full-width timeseries

…ction

…onitor wording

… visibility)

- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3

iovoid · 2026-04-17T12:47:40Z

+                        .duration_since(std::time::UNIX_EPOCH)
+                        .unwrap_or_default()
+                        .as_secs(),
+                    error_type: format!("{:?}", std::mem::discriminant(&error)),


Not sure if something like 'Discriminant(7)' is that useful

avilagaston9

Solid observability PR — well-structured across all 5 layers. Left a few inline comments on things I noticed.

avilagaston9 · 2026-04-17T13:28:54Z

+            .zip(&col_widths)
+            .map(|(val, w)| {
+                let truncated = if val.len() > *w {
+                    format!("{}…", &val[..*w - 1])


val[..*w - 1] slices by byte index, not character index. If val contains multi-byte UTF-8 (peer client versions sometimes include emoji or non-ASCII), this panics with byte index is not a char boundary.

Suggestion:

let truncated = if val.len() > *w { let end = val.char_indices() .take(*w - 1) .last() .map(|(i, c)| i + c.len_utf8()) .unwrap_or(0); format!("{}…", &val[..end])

Or simpler: val.chars().take(w - 1).collect::<String>() (allocates, but this isn't a hot path).

avilagaston9 · 2026-04-17T13:28:54Z

            Err(error) => {
-                match error.is_recoverable() {
+                let recoverable = error.is_recoverable();
+                self.diagnostics.write().await.current_phase = "idle".to_string();


Two separate write-lock acquisitions here and on the .push_error() call below. If diagnostics are read between the two locks (e.g. by the RPC handler), the phase shows idle but the error event is missing. Consider combining into a single lock:

{ let mut diag = self.diagnostics.write().await; diag.current_phase = "idle".to_string(); diag.push_error(SyncErrorEvent { ... }); }

The debug! macro between them doesn't need the lock.

ElFantasma added 8 commits April 13, 2026 18:07

fix(l1): add diagnostics for header download phase in snap sync

e363561

fix(l1): compute live sync diagnostics on query instead of stale snap…

bc7b049

…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.

feat(l1): render arrays of objects as tables in REPL formatter

f20c95f

When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.

refactor(l1): rename _check_degradation to _check_alert_conditions

5066523

The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.

ElFantasma requested a review from a team as a code owner April 13, 2026 21:13

github-actions bot assigned ElFantasma Apr 13, 2026

github-actions bot added the L1 Ethereum client label Apr 13, 2026

github-project-automation bot added this to ethrex_l1 Apr 13, 2026

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

ElFantasma added 3 commits April 13, 2026 18:32

feat(l1): peer_top responsive layout — use full terminal width

3bbfdc8

chore(l1): remove redundant peer_top.sh — peer_top.py is the canonica…

68c9d9b

…l tool

ElFantasma and others added 7 commits April 14, 2026 08:45

fix(l1): default watched phases to empty — opt-in, not opt-out

e217075

Merge branch 'main' into feat/snap-sync-observability-pr

e04782a

feat(l1): add MULTISYNC_WATCHED_PHASES to Makefile for opt-in TRACE

fc87aa6

Merge branch 'main' into feat/snap-sync-observability-pr

bc3764e

fix(l1): remove internal jargon from code comment

572fa9c

ElFantasma mentioned this pull request Apr 14, 2026

feat(l1): snap sync metrics and dashboard #6468

Closed

ElFantasma and others added 11 commits April 14, 2026 16:52

feat(l1): enable metrics in multisync containers, expose mainnet port…

36e9d3d

… 3701

fix(l1): fix global declaration order in docker_monitor.py

ef39564

fix(l1): expose pivot_timestamp gauge — Grafana computes age as time(…

9fef0cb

…) - timestamp

fix(l1): push peer health metrics (eligible, snap, inflight) during s…

05a659f

…ync, not only after

refactor(l1): remove duplicate peer health push from after_sync — dur…

8ff1321

…ing_syncing covers it

refactor(l1): remove redundant prometheus pushes from snap_sync — pus…

f7f8670

…h function in network.rs covers them

feat(l1): expose metrics ports for all three multisync chains (3701/3…

0b860ad

…702/3703)

fix(l1): deduplicate dashboard panels, consistent 4-panel rows

0507f59

fix(l1): consistent dashboard layout — Progress+Counts, Rate, ETA, El…

964be80

…apsed per phase + full-width timeseries

Merge branch 'main' into feat/snap-sync-observability-pr

eb1155b

ElFantasma mentioned this pull request Apr 14, 2026

fix(l1): snap sync pivot update crashes due to peer selection bugs #6474

Open

ElFantasma added 4 commits April 14, 2026 20:37

feat(l1): add composite Sync Rates Overview chart to dashboard top se…

4d29538

…ction

fix(l1): move composite chart legend to bottom for time axis alignment

54db019

fix(l1): distinguish watched-phase tracing from real degradation in m…

866ea60

…onitor wording

fix(l1): log reason changes even when already in alert state (monitor…

905e368

… visibility)

This was referenced Apr 15, 2026

perf(l1): small-account batching in insert_storages to amortize dispatcher overhead #6476

Open

Parallelize merkelization of storage slots #5482

Open

perf(l1): parallelize within-trie build for large storage tries in snap sync insert_storages #6477

Open

ElFantasma mentioned this pull request Apr 15, 2026

docs(l1): snapsync roadmap #6112

Open

1 task

iovoid approved these changes Apr 17, 2026

View reviewed changes

avilagaston9 reviewed Apr 17, 2026

View reviewed changes

Conversation

ElFantasma commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

🤖 Kimi Code Review

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions bot commented Apr 13, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps bot commented Apr 13, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2026

🤖 Claude Code Review

Review: feat(l1): add snap sync observability endpoints and tooling

High severity

Medium severity

Low severity

Nits

Uh oh!

ElFantasma commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review feedback addressed

Fixed

Not addressing (with reasoning)

Uh oh!

iovoid Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

avilagaston9 left a comment

Choose a reason for hiding this comment

Uh oh!

avilagaston9 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

avilagaston9 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ElFantasma commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

Review: `feat(l1): add snap sync observability endpoints and tooling`

ElFantasma commented Apr 14, 2026 •

edited

Loading