Skip to content

feat(l1): add snap sync observability endpoints and tooling#6470

Open
ElFantasma wants to merge 33 commits intomainfrom
feat/snap-sync-observability-pr
Open

feat(l1): add snap sync observability endpoints and tooling#6470
ElFantasma wants to merge 33 commits intomainfrom
feat/snap-sync-observability-pr

Conversation

@ElFantasma
Copy link
Copy Markdown
Contributor

@ElFantasma ElFantasma commented Apr 13, 2026

Motivation

Recent mainnet-9 multisync runs surfaced intermittent snap sync failures that are hard to diagnose from standard logs — in particular, pivot-update failures fired during the healing transition where peer-selection bottlenecks lead to process::exit(2). To investigate and eventually prevent these, we need tools to (a) inspect live sync state from outside the node, (b) capture detailed peer diagnostics when something degrades, and (c) post-mortem a failure with full peer-table context.

Description

End-to-end observability for snap sync, across five layers.

Node — sync metrics (new crates/blockchain/metrics/sync.rs):

  • Per-phase progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total)
  • Peer health: eligible peers, snap peers, inflight requests
  • Pivot tracking: block number, timestamp, age
  • Phase timing: per-phase start timestamps for elapsed calculation
  • Outcome counters: pivot updates, storage requests, header resolution (by outcome label)
  • Registered into the default Prometheus registry at init — no per-request overhead

Node — RPC endpoints:

  • admin_syncStatus: current phase, pivot block, staleness info, phase progress. Resets to idle on both success and error paths.
  • admin_peerScores: full peer table with scores, capabilities, supported block ranges, and per-capability eligibility. Computed live per query (not read from a possibly-stale snapshot)
  • Additional header-download-phase diagnostics in snap_sync.rs

Grafana dashboard (metrics/provisioning/grafana/dashboards/common_dashboards/snapsync_dashboard.json):

image - [Live dashboard on ethrex-grafana](https://grafana.ethrex.xyz/d/ethrex-snapsync/ethrex-snap-sync) - 7 sections: Sync Overview, Peer Health, Headers, Accounts, Storage, Healing, Bytecodes - Each phase section: progress gauge with downloaded/total counts, rate gauge, ETA, elapsed time, full-width rate-over-time timeseries - Peer Health row: eligible peers, pivot age (live from timestamp), inflight requests, pivot update count - All rates computed in Grafana via `rate(metric[5m])` — no in-app rate computation - ETAs computed as `remaining / rate` with division-by-zero guards - Multi-instance support via `$instance` variable (mainnet:3701, hoodi:3702, sepolia:3703) - Docker compose updated to expose metrics ports and enable `--metrics` on all containers

Monitor (tooling/sync/docker_monitor.py):

  • Rolling snapshot buffer per instance, dumped to run directory on demand
  • Degradation detection: low eligible peers, high staleness ratio
  • Configurable watched phases (--watched-phases): sync phases that trigger TRACE logging and fast polling (5s). Empty by default — opt-in, not opt-out. Use MULTISYNC_WATCHED_PHASES=healing in the Makefile or --watched-phases "healing" directly.
  • On degradation: poll drops to 5s, log level bumps to TRACE via admin_setLogLevel, snapshots dumped to disk; restored on recovery
  • Force-dump on any failure for post-mortem (with final RPC poll to capture peer state at failure time)

Live TUI (tooling/sync/peer_top.py):

image - Top-style live view of sync status + peer table sorted by score - Responsive layout — adapts to terminal width, trims client names when narrow - Score delta arrows (green ↑ / red ↓) to spot scoring changes at a glance - Polls `admin_syncStatus` and `admin_peerScores`

REPL (tooling/repl/src/formatter.rs + admin commands):

  • Table rendering for arrays of objects (needed to read the peer table returned by admin_peerScores)
  • Admin commands wired to the new endpoints

Note: This PR incorporates the snap sync progress metrics and Grafana dashboard from #6468 (Tomi/Esteve), with adaptations: metrics use the default Prometheus registry, throughput rates are computed in Grafana via rate() instead of in-app, and a new peer-health row was added. I closed #6468, as it was superseded.

Checklist

  • Tested with make multisync-loop-auto on ethrex-mainnet-9 (mainnet, sepolia, hoodi in parallel)
  • AI agent review feedback addressed (see comment)
  • Grafana dashboard tested with live data from 3 networks

…or diagnostics

Add multi-layered observability for snap sync to diagnose intermittent
pivot update failures on mainnet:

- Prometheus metrics: eligible peers, inflight requests, pivot age,
  phase tracking, pivot update/storage request/header resolution
  outcome counters
- Admin RPC: admin_peerScores (full peer table with scores, eligibility)
  and admin_syncStatus (phase, pivot, staleness, recent events)
- Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness,
  pivot updates with peer pool state, error classification
- Docker monitor: polls new RPC endpoints, keeps rolling snapshot
  buffer, dumps on degradation detection with adaptive frequency
- REPL: admin.peerScores and admin.syncStatus commands
…n issues detected

The docker monitor bumps log level to TRACE via admin_setLogLevel RPC
when degradation is detected, capturing detailed per-peer dialogue in
container logs. Restores to DEBUG on recovery.

Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync
events (staleness, pivot updates, error classification).
…shots

Read pivot age from current time and progress counters from METRICS
atomics on each admin_syncStatus call, so the RPC always returns
fresh values instead of stale data from the last phase transition.
When a JSON response contains an array of objects (like admin_peerScores
peers list), render as an aligned table with headers instead of collapsing
to a single line. Makes diagnostic endpoints readable in the REPL.
…ect healing phase

- on_failure: do a final RPC poll and bump log level before dumping
- _dump_snapshots: support force=True to re-dump even if already dumped
- Detect healing phase as degradation trigger to get 5s polling during
  the high-risk window where pivot updates happen
- Always capture peer state at time of failure for post-mortem analysis
The method now detects both actual degradation (low eligible peers,
staleness) and high-risk phases (healing), so the name should reflect
the broader scope.
…radation

The monitor was trying to dump peer snapshots to the run directory during
degradation events, but the directory didn't exist yet (created at run end
by save_all_logs). Now set_run_id() creates the directory immediately.

Also adds peer_top.py live viewer script.
@ElFantasma ElFantasma requested a review from a team as a code owner April 13, 2026 21:13
@github-actions github-actions bot added the L1 Ethereum client label Apr 13, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

The PR adds comprehensive sync diagnostics and metrics. Overall structure is good, but there are several issues to address:

Critical Issues

  1. Inefficient metrics gathering (crates/blockchain/metrics/sync.rs:140-158)
    Creating a new Registry and re-registering all metrics on every gather_metrics() call is expensive and unnecessary. The Prometheus registry should be static or cached.

    // Current: Creates new registry every time
    pub fn gather_metrics(&self) -> Result<String, MetricsError> {
        let r = Registry::new();  // Don't do this
        // ... register all metrics ...
    }
  2. Missing import in Python (tooling/sync/docker_monitor.py)
    Optional is used in type hints but not imported from typing. Add:

    from typing import Optional

Security/Correctness

  1. Potential lock poisoning (crates/networking/p2p/sync/snap_sync.rs:337)
    While tokio::sync::RwLock doesn't poison on panic, holding the write lock while doing string allocations could briefly block readers. Consider using explicit scopes consistently (as done at line 345) to minimize lock duration.

  2. Integer casting (crates/networking/p2p/network.rs:730-732)
    Casting usize to i64 for metrics could overflow on 64-bit platforms with extreme peer counts (unlikely but worth noting). Use try_from or saturating cast.

Code Quality

  1. Feature flag consistency (crates/blockchain/metrics/mod.rs:17)
    The sync module is guarded by any(feature = "api", feature = "metrics") but all usage sites use only feature = "metrics". Consider aligning these or document why the API feature alone would need the sync metrics module.

  2. Error handling in UTF-8 conversion (crates/blockchain/metrics/sync.rs:158)
    While Prometheus text encoder should produce valid UTF-8, the ? operator converts FromUtf8Error to MetricsError. Ensure MetricsError has a variant for this (appears to via PrometheusErr).

Minor Issues

  1. Python type hints (tooling/sync/docker_monitor.py:61+)
    Uses dict[str, ...] and list[dict] which require Python 3.9+. If supporting older versions, use Dict/List from typing.

  2. Debug log formatting (crates/networking/p2p/sync/snap_sync.rs:843-849)
    The debug! macro call spans multiple lines with complex formatting. Consider extracting the pivot_age calculation to a variable before the log call for readability.

Positive Observations

  • Good use of Arc<RwLock<SyncDiagnostics>> for shared mutable state
  • Proper feature-gating of metrics code with #[cfg(feature = "metrics")]
  • saturating_sub used correctly for timestamp arithmetic to prevent underflow
  • PeerDiagnostics provides useful visibility without exposing sensitive peer data
  • The peer_top.py tool is well-structured with proper terminal handling

Recommendations

  1. Cache the Prometheus Registry in MetricsSync instead of creating it in gather_metrics
  2. Add from typing import Optional to docker_monitor.py if not present
  3. Consider using parking_lot::RwLock instead of tokio::sync::RwLock for SyncDiagnostics if the lock is held across very short critical sections only (not async), though the current choice is fine for async contexts

The RPC endpoints (admin_peerScores, admin_syncStatus) are appropriately placed in the admin namespace and don't expose sensitive information.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 13, 2026

Lines of code report

Total lines added: 604
Total lines removed: 0
Total lines changed: 604

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/mod.rs        | 54    | +2   |
+------------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/sync.rs       | 174   | +174 |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/metrics.rs        | 618   | +2   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/network.rs        | 747   | +92  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 573   | +22  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs     | 1035  | +49  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs    | 1181  | +13  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs           | 309   | +65  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1126  | +107 |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync_manager.rs   | 190   | +45  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/rpc/admin/peers.rs    | 182   | +31  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/rpc/rpc.rs            | 1013  | +2   |
+------------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. admin_syncStatus can get stuck reporting an active phase after a recoverable sync failure. The success path resets current_phase to "idle", but the error path only records recent_errors and leaves the previous phase intact. That means a node that has already stopped syncing can still report "healing"/"bytecodes" indefinitely, which is misleading for operators and interacts badly with the new monitor. Reset the phase on all exits, or introduce an explicit retry/error state. sync.rs sync.rs

  2. DiagnosticsTracker.on_failure() never performs the “final poll” it claims to do. It sets last_poll[name] = 0 and calls self.poll(inst), but poll() immediately returns for inst.status in ("success", "failed", "waiting"). In practice, failure snapshots will be dumped without the last peer/sync RPC state, which is the most useful moment to capture. docker_monitor.py docker_monitor.py

  3. tooling/sync/peer_top.sh is broken as committed. The embedded Python uses invalid f-strings with nested double quotes, for example f"Peers: {s["total_peers"]} ...", so it raises SyntaxError before producing any output. I verified the embedded snippet does not compile. peer_top.sh peer_top.sh peer_top.sh

No direct EVM opcode, gas-accounting, trie, RLP, or consensus-rule changes stood out here; the risk is mainly in sync observability and the new tooling paths.

I could not run cargo check in this sandbox because rustup tried to write under a read-only path. I did validate the touched Python files with py_compile.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 13, 2026

Greptile Summary

This PR adds end-to-end snap-sync observability across four layers: a new MetricsSync Prometheus metrics module, two new admin RPC endpoints (admin_peerScores, admin_syncStatus), a DiagnosticsTracker-enhanced Docker monitor, and a live TUI (peer_top.py). The changes are purely additive; core sync logic is unmodified, and all metrics recording is gated behind #[cfg(feature = "metrics")].

Confidence Score: 5/5

Safe to merge — all changes are purely additive observability; core sync logic is untouched.

All three findings are P2 (style/quality suggestions): one performance nit in non-critical debug logging, one cosmetic key-name duplication in a diagnostic response, and one defensive-coding improvement in a monitoring tool. None affect correctness of sync, storage, or the RPC API. The metrics are feature-gated, the shared-Arc pattern is correctly implemented, and the new endpoints gracefully handle absent peer_handler/syncer.

crates/networking/p2p/sync/snap_sync.rs (update_pivot debug overhead), crates/networking/p2p/sync_manager.rs (duplicate phase_progress key)

Important Files Changed

Filename Overview
crates/networking/p2p/sync/snap_sync.rs Phase tracking and pivot diagnostics wired in; read_peer_diagnostics() called on every update_pivot loop iteration for debug logging, adding actor roundtrip overhead in the hot path.
crates/networking/p2p/sync_manager.rs get_sync_diagnostics() correctly shares Arc with Syncer; inserts "accounts_downloaded" key while snap_sync.rs separately inserts "account_ranges_downloaded" from the same METRICS atomic — both will appear in admin_syncStatus.phase_progress.
crates/blockchain/metrics/sync.rs New MetricsSync struct; correct per-call registry pattern, clean gauge/counter API.
crates/networking/rpc/admin/peers.rs peer_scores and sync_status handlers correctly guard against missing peer_handler/syncer; average_score uses integer division (expected for diagnostics).
tooling/sync/peer_top.py TUI correctly polls both admin endpoints; truthy-but-malformed response (missing "summary"/"peers" keys) would raise uncaught KeyError and silently exit the tool.
tooling/sync/docker_monitor.py DiagnosticsTracker correctly uses rolling buffers, degradation detection, and log level escalation; run_id guard and force-dump on failure are handled properly.
crates/networking/p2p/sync.rs SyncDiagnostics struct and bounded deques for pivot/error history are clean; Arc shared correctly between Syncer and SyncManager.
crates/networking/rpc/rpc.rs Two new admin routes correctly registered; no other changes.
tooling/repl/src/formatter.rs Array-of-objects table rendering is additive and well-scoped; no logic issues.
tooling/repl/src/commands/admin.rs New peerScores and syncStatus REPL commands wired correctly to the new RPC methods.

Sequence Diagram

sequenceDiagram
    participant Client as RPC Client
    participant RPC as admin RPC (rpc.rs)
    participant SM as SyncManager
    participant Diag as Arc<RwLock<SyncDiagnostics>>
    participant Snap as snap_sync (Syncer)
    participant PT as PeerTable (actor)

    Note over Snap,Diag: During sync cycle
    Snap->>Diag: write() — current_phase, pivot_block, phase_progress
    Snap->>Diag: push_pivot_change() / push_error()

    Note over Client,PT: admin_syncStatus RPC
    Client->>RPC: admin_syncStatus
    RPC->>SM: get_sync_diagnostics()
    SM->>Diag: read().clone()
    SM->>SM: override pivot_age_seconds (live)
    SM->>SM: merge phase_progress from METRICS atomics
    SM-->>RPC: SyncDiagnostics
    RPC-->>Client: JSON response

    Note over Client,PT: admin_peerScores RPC
    Client->>RPC: admin_peerScores
    RPC->>PT: get_peer_diagnostics()
    PT-->>RPC: Vec<PeerDiagnostics>
    RPC->>RPC: compute summary (total, eligible, avg_score)
    RPC-->>Client: JSON {peers, summary}
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 779-788

Comment:
**Full peer diagnostics fetch inside hot retry loop**

`read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/sync_manager.rs
Line: 154-156

Comment:
**Duplicate phase-progress key alongside `"account_ranges_downloaded"`**

`snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tooling/sync/peer_top.py
Line: 140-141

Comment:
**Unchecked dict access will crash the TUI on unexpected RPC response**

`data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently.

```suggestion
    s = data.get("summary", {})
    peers = data.get("peers", [])
    if not s or peers is None:
        lines.append(f"{RED}Unexpected response from node{RESET}")
        return lines
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(l1): create run directory early so s..." | Re-trigger Greptile

Comment on lines +779 to +788
let diag = peers.read_peer_diagnostics().await;
let eligible_count = diag.iter().filter(|p| p.eligible).count();
let total_count = diag.len();
debug!(
eligible_peers = eligible_count,
total_peers = total_count,
selected_peer = %peer_id,
peer_score = peer_score,
consecutive_failures = consecutive_failures,
total_failures = total_failures,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Full peer diagnostics fetch inside hot retry loop

read_peer_diagnostics() performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the update_pivot loop (up to MAX_TOTAL_FAILURES = 15 times, and called again on each healing-phase pivot refresh). The data is used only for a debug! log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a tracing::enabled!(Level::DEBUG) check.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 779-788

Comment:
**Full peer diagnostics fetch inside hot retry loop**

`read_peer_diagnostics()` performs a full actor roundtrip plus an O(N_peers) pass over the peer table on every iteration of the `update_pivot` loop (up to `MAX_TOTAL_FAILURES = 15` times, and called again on each healing-phase pivot refresh). The data is used only for a `debug!` log, so it adds unnecessary actor-call latency on every attempt. Consider moving this fetch outside the loop (capture once before entering) or gating it with a `tracing::enabled!(Level::DEBUG)` check.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +154 to +156
if accounts_downloaded > 0 {
diag.phase_progress
.insert("accounts_downloaded".into(), accounts_downloaded);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate phase-progress key alongside "account_ranges_downloaded"

snap_sync.rs (line 356) writes "account_ranges_downloaded" to phase_progress from the same METRICS.downloaded_account_tries atomic. get_sync_diagnostics() then additionally inserts "accounts_downloaded" from the same source. After admin_syncStatus is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. "accounts_downloaded") and remove the other write.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync_manager.rs
Line: 154-156

Comment:
**Duplicate phase-progress key alongside `"account_ranges_downloaded"`**

`snap_sync.rs` (line 356) writes `"account_ranges_downloaded"` to `phase_progress` from the same `METRICS.downloaded_account_tries` atomic. `get_sync_diagnostics()` then additionally inserts `"accounts_downloaded"` from the same source. After `admin_syncStatus` is called, consumers will see both keys populated with the same value, which is confusing and wastes response payload. Align on one canonical key (e.g. `"accounts_downloaded"`) and remove the other write.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread tooling/sync/peer_top.py
Comment on lines +140 to +141
s = data["summary"]
peers = data["peers"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unchecked dict access will crash the TUI on unexpected RPC response

data being truthy only means the result field was present and non-null; it doesn't guarantee the "summary" or "peers" keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). data["summary"] will raise KeyError, which propagates to the outer except Exception: cleanup() and exits the tool silently.

Suggested change
s = data["summary"]
peers = data["peers"]
s = data.get("summary", {})
peers = data.get("peers", [])
if not s or peers is None:
lines.append(f"{RED}Unexpected response from node{RESET}")
return lines
Prompt To Fix With AI
This is a comment left during a code review.
Path: tooling/sync/peer_top.py
Line: 140-141

Comment:
**Unchecked dict access will crash the TUI on unexpected RPC response**

`data` being truthy only means the `result` field was present and non-null; it doesn't guarantee the `"summary"` or `"peers"` keys exist (e.g. if the node returns an error object, a future schema change, or a partial response). `data["summary"]` will raise `KeyError`, which propagates to the outer `except Exception: cleanup()` and exits the tool silently.

```suggestion
    s = data.get("summary", {})
    peers = data.get("peers", [])
    if not s or peers is None:
        lines.append(f"{RED}Unexpected response from node{RESET}")
        return lines
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is my review of PR #6470:


Review: feat(l1): add snap sync observability endpoints and tooling

The motivation is solid — intermittent snap sync failures on mainnet are genuinely hard to diagnose post-hoc, and the combination of admin RPCs, auto-escalating log verbosity, a TUI, and Prometheus metrics gives operators a useful toolkit. The overall architecture is coherent. There are a few correctness and design issues worth addressing.


High severity

1. peer_top.sh: Python 3.12+ only syntax

The embedded Python in peer_top.sh uses f-string nested dict-key quotes:

print(f"Peers: {s["total_peers"]}  Eligible: {s["eligible_peers"]} ...")
print(f"{"Peer ID":>14} {"Score":>6} ...")
for p in ...:
    print(f"{p["score"]:>6} {p["inflight_requests"]:>5} ...")

Nested quotes inside f-string expressions (f"{s["key"]}") are only valid in Python 3.12+. On Python 3.10/3.11 (still in very widespread use, including most LTS distros), this raises SyntaxError before the script even runs. Fix by using the alternate quote style inside the f-string: f"{s['total_peers']}", or assign column header strings to variables before formatting.


2. Healing phase unconditionally triggers degradation (docker_monitor.py)

if phase == "healing":
    reasons.append("healing_phase")

This fires on every single poll during healing, which is a normal sync phase that can last hours on mainnet. Consequences:

  • The node stays in TRACE logging mode for the entire healing phase — enormous log volume
  • Poll frequency drops to 5s (6× overhead) for hours
  • Recovery is permanently blocked while in healing, since healthy_since only advances when reasons is empty

At minimum this should be combined with another condition (e.g., eligible_peers < threshold), or separated into a "increase verbosity" path that doesn't activate the full degradation response.


3. Progress stall detection is dead code (docker_monitor.py)

DEGRADATION_STALL_TIMEOUT = 60 is defined and never referenced. The stall check inside _check_alert_conditions:

if self._last_progress[name] is not None and self._last_progress[name] == progress_key:
    pass  # tracked by the outer loop via last_block

detects a stall but immediately passes without adding to reasons. The comment implies this is intentionally deferred to some other code, but no degradation is triggered by stalls anywhere. If this is an incomplete feature, remove the dead constant and the unreachable detection block to avoid confusion.


Medium severity

4. _dump_snapshots silently skips subsequent degradation events

dumped_for_run[name] is set to True after the first dump and never reset mid-run. Combined with the healing-phase issue above, the most diagnostic data (from late in the run when failures are actually occurring) is never captured. Subsequent calls to _dump_snapshots return immediately:

if not force and self.dumped_for_run.get(name):
    return

At minimum, log a message when a dump is skipped (print(f"[{name}] Skipping snapshot dump (already dumped for this run)")). Consider whether the intent is "one dump per run" (current behavior) or "dump on each distinct degradation event."


5. Fragile table section detection in format_object_box (formatter.rs)

if value.starts_with('\n') && value.contains("items)") {
    table_sections.push((key, value));
}

This couples format_object_box's parsing to the specific footer string of format_object_array_table. If the footer format changes, or if any scalar string value happens to match (e.g., a multi-line peer status message containing "items)"), the layout silently misbehaves. A typed discriminant would be more robust:

enum RowValue { Scalar(String), Table(String) }

This would let flatten_object return typed variants instead of relying on string inspection.


Low severity

6. gather_metrics() clones values at clone time

Each call to gather_metrics() constructs a fresh Registry and re-registers cloned metrics. The prometheus Clone for IntGauge/IntCounterVec shares the underlying Arc<AtomicI64>, so this actually does see live values — but the re-registration work is done on every scrape. This is consistent with how other metrics modules in this codebase work, so it's acceptable, but a brief comment explaining the "register fresh on each gather" pattern would help future contributors who expect the standard "register once at startup" approach.

7. Integer truncation for avg_score (admin/peers.rs)

let avg_score = if total > 0 {
    diagnostics.iter().map(|p| p.score).sum::<i64>() / total as i64
} else { 0 };

Integer division truncates rather than rounds. Fine for an observability endpoint, but worth noting that with scores clustered near a threshold this can mislead (e.g., 5 peers with scores [4,4,4,4,5] reports avg 4 rather than 4.2).

8. Timestamp 0 fallback in record_success (peer_table.rs)

.duration_since(std::time::UNIX_EPOCH)
.unwrap_or_default()
.as_secs()

If the system clock is behind the epoch (misconfigured clock), unwrap_or_default() silently records last_response_time = Some(0). Downstream consumers (e.g., peer_top.py displaying this value) would show "1970-01-01" rather than "never responded." A sentinel like u64::MAX or keeping None on error would be less confusing.

9. PeerDiagnostics.ip exposure

The ip field is serialized into admin_peerScores responses. Admin endpoints are presumably firewalled, but a comment in the struct acknowledging the sensitivity would be appropriate — peer IP addresses can be used for targeted denial-of-service against specific network participants.

10. Arc<RwLock<SyncDiagnostics>> threading depth

The diagnostics arc is now passed through: sync_cycle_snapsnap_syncrequest_account_rangeupdate_pivot. This is four levels of parameter threading for what is essentially a write-only observability sink. Given that METRICS_SYNC is already a LazyLock global (accessed directly without threading), a OnceLock<Arc<RwLock<SyncDiagnostics>>> global initialized at Syncer::new() would be consistent with the existing pattern and would avoid polluting all these function signatures.


Nits

  • PivotChangeEvent.old_pivot_number is u64 (non-optional), so the first pivot change records old_pivot_number: 0. A Option<u64> would be more semantically accurate for the "no prior pivot" case.
  • SyncDiagnostics.phase_progress: HashMap<String, u64> — JSON object key order is non-deterministic. Fine for human-readable output, but worth noting if tests ever assert on the serialized form.
  • The peer_top.py TUI is well-executed — color-coded staleness, score delta arrows, and capability grouping are all useful at a glance.

Automated review by Claude (Anthropic) · sonnet · custom prompt

…ixes

- fix clippy redundant closure in formatter.rs
- fix unnecessary u64 casts in sync_manager.rs
- migrate MetricsSync to default prometheus registry (Pattern B) so new
  sync metrics register once at init instead of per-gather. Removes
  gather_metrics() — exported via gather_default_metrics() automatically
- fix admin_syncStatus reporting stale phase after recoverable error
- fix on_failure() poll bypass — add force param so final poll executes
  even when instance status is already "failed"
- fix peer_top.sh Python 3.12+ f-string syntax — use str.format()
- remove dead DEGRADATION_STALL_TIMEOUT constant
@ElFantasma
Copy link
Copy Markdown
Contributor Author

ElFantasma commented Apr 14, 2026

Review feedback addressed

Thanks for the thorough reviews. Here's how each finding was handled.

Fixed

Finding Raised by Fix
Clippy: redundant closure in formatter.rs:187 CI lint .map(|v| inline_value(v)).map(inline_value)
Clippy: unnecessary u64 casts in sync_manager.rs CI lint Removed 3 redundant as u64 casts
peer_top.sh broken on Python < 3.12 (nested f-string quotes) Claude, Codex Removed peer_top.sh entirely — peer_top.py is the canonical tool
MetricsSync creates new Registry every gather_metrics() call Kimi Migrated to default Prometheus registry: metrics register once at init via register_int_gauge! / register_int_counter_vec!, exported via gather_default_metrics(). Removed gather_metrics() entirely. The same per-call pattern exists in 6 other modules — migrating those is out of scope but this sets the example.
admin_syncStatus reports stale phase after recoverable sync error Codex Reset current_phase = "idle" on the error path in sync.rs
on_failure() never executes the "final poll" Codex Added force: bool param to poll(). on_failure() now calls poll(inst, force=True) to bypass the status check
Dead DEGRADATION_STALL_TIMEOUT constant Claude Removed constant and stale comment reference
Healing phase unconditionally triggers degradation (TRACE + fast polling for hours) Claude Made configurable: WATCHED_PHASES set (empty by default — opt-in, not opt-out). Override via --watched-phases "healing" CLI flag or MULTISYNC_WATCHED_PHASES=healing Makefile variable. No TRACE logging unless explicitly requested.
Progress stall detection is dead code Claude Removed the dead constant and updated comment to # stall detection not yet implemented

Not addressing (with reasoning)

  1. read_peer_diagnostics() in update_pivot loop adds actor roundtrip overhead (Greptile) — This is in a debug logging path that only fires during pivot updates (rare, seconds-scale operations). The overhead of one actor call is negligible relative to the 15s peer timeout per attempt.

  2. Duplicate phase_progress keys (accounts_downloaded vs account_ranges_downloaded) (Greptile) — These are intentionally distinct: account_ranges_downloaded counts accounts received from peers (network phase), while accounts_downloaded counts accounts committed to temp storage (insertion phase). They track different stages of the pipeline.

  3. peer_top.py uncaught KeyError on malformed response (Greptile) — This is a dev/debugging tool, not a production service. Crashing on malformed data is the correct behavior — it surfaces issues rather than hiding them.

  4. Lock poisoning / integer casting / feature flag alignment (Kimi) — tokio::sync::RwLock doesn't poison. usizei64 for peer counts won't overflow. Feature flag follows the existing pattern in mod.rs.

  5. Python type hints use dict[str, ...] / list[dict] (Python 3.9+) (Kimi) — Our Docker images and development environments use Python 3.10+. Python 3.8 is EOL. Acceptable minimum.

ElFantasma and others added 7 commits April 14, 2026 08:45
Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set.
Phases in this set get TRACE logging and fast polling — useful for
investigating specific sync stages without editing the script.

Default: healing (current investigation target).
Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"
Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the
observability PR, with improvements:

- Add progress gauges: headers, accounts, storage, healing, bytecodes
  (downloaded/inserted/total) + stage + pivot_block
- Push from METRICS atomics via push_sync_prometheus_metrics() in
  network.rs, called each polling cycle and on phase completion
- Grafana dashboard with 7 rows: overview, peer health, headers,
  accounts, storage, healing, bytecodes — with progress gauges, rate
  panels (using Grafana rate() instead of app-computed rates), and ETA
- All metrics use default Prometheus registry (register at init)
- New peer-health row with eligible peers, pivot age, inflight requests,
  and pivot update outcomes — not present in the original PR

Supersedes #6468.
ElFantasma and others added 11 commits April 14, 2026 16:52
- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on
  each phase transition, persists for completed phases
- Grafana computes elapsed as time() - timestamp, per phase
- Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live)
- pivot_age_seconds also updated each push cycle for RPC/peer_top
- Added ETA panels for headers, accounts, bytecodes (remaining/rate)
- Added elapsed panels for all phases in their respective rows
- Overview row: replaced Throughput (N/A during sync) with Phase Elapsed
  showing all active/completed phase timings
ElFantasma added a commit that referenced this pull request Apr 15, 2026
- Add §1.18 observability tooling (PR #6470)
- Add §1.19 pivot update reliability (PR #6475, issue #6474)
- Add §1.20 big-account within-trie parallelization (issue #6477)
- Add §1.21 small-account batching (issue #6476)
- Add §1.22 decoded TrieLayerCache (PR #6348)
- Add §1.23 bloom filter for non-existent storage (PR #6288)
- Add §1.24 adaptive request sizing + bisection (PR #6181)
- Add §1.25 concurrent bytecode + storage (PR #6205)
- Add §1.26 phase completion markers (PR #6189)
- Add §2.18 StorageTrieTracker refactor (PR #6171)
- Update current-state bottleneck table with small-account and pivot-update findings
- Reprioritize timeline: pivot-update crash fix is now priority 0
- Add two risks (pivot crash masks perf work, DB corruption on every crash)
- Bump doc version to 1.3
@ElFantasma ElFantasma mentioned this pull request Apr 15, 2026
1 task
.duration_since(std::time::UNIX_EPOCH)
.unwrap_or_default()
.as_secs(),
error_type: format!("{:?}", std::mem::discriminant(&error)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if something like 'Discriminant(7)' is that useful

Copy link
Copy Markdown
Contributor

@avilagaston9 avilagaston9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid observability PR — well-structured across all 5 layers. Left a few inline comments on things I noticed.

.zip(&col_widths)
.map(|(val, w)| {
let truncated = if val.len() > *w {
format!("{}…", &val[..*w - 1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val[..*w - 1] slices by byte index, not character index. If val contains multi-byte UTF-8 (peer client versions sometimes include emoji or non-ASCII), this panics with byte index is not a char boundary.

Suggestion:

let truncated = if val.len() > *w {
    let end = val.char_indices()
        .take(*w - 1)
        .last()
        .map(|(i, c)| i + c.len_utf8())
        .unwrap_or(0);
    format!("{}…", &val[..end])

Or simpler: val.chars().take(w - 1).collect::<String>() (allocates, but this isn't a hot path).

Err(error) => {
match error.is_recoverable() {
let recoverable = error.is_recoverable();
self.diagnostics.write().await.current_phase = "idle".to_string();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two separate write-lock acquisitions here and on the .push_error() call below. If diagnostics are read between the two locks (e.g. by the RPC handler), the phase shows idle but the error event is missing. Consider combining into a single lock:

{
    let mut diag = self.diagnostics.write().await;
    diag.current_phase = "idle".to_string();
    diag.push_error(SyncErrorEvent { ... });
}

The debug! macro between them doesn't need the lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants