Skip to content

feat(l1): fullsync metrics#6466

Open
Arkenan wants to merge 7 commits intomainfrom
fullsync-metrics
Open

feat(l1): fullsync metrics#6466
Arkenan wants to merge 7 commits intomainfrom
fullsync-metrics

Conversation

@Arkenan
Copy link
Copy Markdown
Collaborator

@Arkenan Arkenan commented Apr 11, 2026

Add metrics for full sync so that they can be seen in a grafana dashboard.

@github-actions github-actions bot added the L1 Ethereum client label Apr 11, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR adds comprehensive Prometheus metrics for full sync operations. The implementation is generally well-structured and follows Rust best practices. Here are my observations:

Critical Issues

None found. This is a metrics-only change that doesn't modify consensus-critical logic.

Potential Issues

1. Integer Overflow Risk (Low severity, technical debt)

File: crates/blockchain/metrics/fullsync.rs
Lines: 229, 232, 238, 241

The cast_signed() method converts u64 block numbers to i64 for Prometheus IntGauge. While current Ethereum block numbers (~20M) are far below i64::MAX (~9e18), this is a latent overflow risk.

// Line 229
self.target_block.set(block.cast_signed());

Suggestion: Use saturating conversion or explicit bounds checking:

self.target_block.set(block.try_into().unwrap_or(i64::MAX));

2. Missing Block Hash Debug Info

File: crates/networking/p2p/sync/full.rs
Lines: 185-191

The new timing log replaces the previous log that included first/last block hashes. While cleaner, this removes valuable debugging context when execution fails.

Suggestion: Keep block hashes in the log or log them separately at debug level:

debug!(
    "Batch range: {:#?} to {:#?}",
    blocks.first().map(|b| b.hash()),
    blocks.last().map(|b| b.hash())
);

3. Magic Numbers for Sync Stages

File: crates/blockchain/metrics/fullsync.rs
Lines: 15-17, and usage in full.rs

Stage values (0=idle, 1=headers, etc.) are documented but not enforced via types.

Suggestion: Define an enum for type safety (conversion to i64 is trivial):

#[repr(i64)]
enum SyncStage { Idle = 0, DownloadingHeaders = 1, ... }

Code Quality Suggestions

4. Macro Hygiene

File: crates/networking/p2p/sync/full.rs
Lines: 14-21

The fullsync_metrics! macro is clean, but consider moving it to ethrex-metrics crate if other modules will need similar conditional instrumentation.

5. Registry Allocation

File: crates/blockchain/metrics/fullsync.rs
Lines: 302-327

gather_metrics() creates a new Registry and clones all metrics on every scrape. While acceptable for low-frequency scrapes, this allocates unnecessarily.

Suggestion: Consider registering metrics once at startup and storing the registry in the struct, or document that this is intentionally ephemeral to avoid global registry state.

6. Timestamp Precision

File: crates/blockchain/metrics/fullsync.rs
Lines: 260-270

Using unwrap_or_default() on SystemTime::duration_since will return 0 for times before Unix epoch. Document this behavior or use unwrap_or_else with a more appropriate fallback.

Positive Observations

  1. Proper feature gating: All metrics code is correctly guarded with #[cfg(feature = "metrics")]
  2. Division by zero protection: Good use of .max(0.001) in rate calculations (lines 296 in full.rs)
  3. Consistent naming: Metrics follow Prometheus conventions with fullsync_ prefix and _ms suffix for milliseconds
  4. Modern Rust: Uses LazyLock (Rust 1.80+) for static initialization instead of lazy_static
  5. Error handling: The gather_metrics method properly propagates errors rather than panicking

Summary

The PR is ready to merge after addressing the overflow risk (Item 1). The metrics provide valuable observability into sync performance bottlenecks (execution vs. merkleization vs. storage) which will help operators diagnose node performance issues.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 11, 2026

Lines of code report

Total lines added: 363
Total lines removed: 0
Total lines changed: 363

Detailed view
+----------------------------------------------+-------+------+
| File                                         | Lines | Diff |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs       | 2509  | +22  |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/api.rs      | 59    | +5   |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/fullsync.rs | 280   | +280 |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/mod.rs      | 54    | +2   |
+----------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs    | 302   | +54  |
+----------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough information for a thorough review. Here it is:


Review: feat(l1): fullsync metrics (#6466)

Overall: A well-structured metrics addition that follows existing patterns in the codebase. There are two correctness issues worth addressing before merge.


Bugs

1. batch_total_ms is measured twice with inconsistent values

crates/networking/p2p/sync/full.rs:298-304

let execution_time: f64 = execution_start.elapsed().as_millis() as f64 / 1000.0;
let blocks_per_second = blocks_len as f64 / execution_time.max(0.001);

fullsync_metrics!(
    ...
    METRICS_FULLSYNC.set_batch_total_ms(execution_start.elapsed().as_millis() as f64);
    ...
);

execution_start.elapsed() is called a second time inside the fullsync_metrics! block. By the time it's called, the time elapsed includes the set_blocks_executed and set_blocks_per_second calls that precede it. The batch_total_ms gauge will always be slightly larger than the execution_time used for blocks_per_second, creating a subtle inconsistency in the dashboard.

Fix: capture elapsed once and reuse it:

let elapsed_ms = execution_start.elapsed().as_millis();
let execution_time: f64 = elapsed_ms as f64 / 1000.0;
let blocks_per_second = blocks_len as f64 / execution_time.max(0.001);

fullsync_metrics!(
    ...
    METRICS_FULLSYNC.set_batch_total_ms(elapsed_ms as f64);
    ...
);

2. transitions_ms is measured and logged but not exposed as a metric

crates/blockchain/blockchain.rs:2277

info!(
    "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms",
    blocks_len, exec_ms, transitions_ms, merkle_ms, store_ms, interval.elapsed().as_millis()
);

The [FULLSYNC TIMING] log line reports four phase durations, but MetricsFullSync only has setters for three (set_batch_execution_ms, set_batch_merkle_ms, set_batch_store_ms). There is no batch_transitions_ms field or setter. A Grafana user comparing the log output against the dashboard will find one phase missing. Either add a batch_transitions_ms gauge and wire it up in the #[cfg(feature = "metrics")] block in blockchain.rs, or drop transitions_ms from the log line.


Concerns

3. batch_total_ms semantics are not additive with the exposed parts

batch_total_ms is set from add_blocks_in_batch and covers the full execution round-trip including store.forkchoice_update. The three "part" timings (execution_ms, merkle_ms, store_ms) are set from blockchain.rs and cover only the inner phases. A Grafana panel summing the parts will always show less than batch_total_ms, and the gap will be unexplained. A comment in the metric description (e.g. "includes forkchoice update overhead not captured in individual phase metrics") would help dashboard authors.

4. Lost block hash at the start of execution

The pre-existing log:

"Executing {} blocks for full sync. First block hash: {:#?} Last block hash: {:#?}"

is replaced by a timing log. Block hashes are still present in the post-execution [SYNCING] Executed & stored… log, but removing the pre-execution hash log makes it harder to identify which batch was being processed if the sync hangs mid-execution. Worth keeping as a debug log or restoring it alongside the timing log.


Minor

5. set_stage uses raw magic numbers

Call sites pass 0/1/2/3 directly:

METRICS_FULLSYNC.set_stage(1);  // what does 1 mean here?

The semantics are documented only in the metric help string. Named constants (STAGE_IDLE, STAGE_HEADERS, STAGE_BODIES, STAGE_EXECUTING) or a small enum would make call sites self-documenting and prevent passing out-of-range values.

6. Mixed fullsync_metrics! and bare #[cfg] at the variable declaration

full.rs:72-73:

#[cfg(feature = "metrics")]
let mut header_batch_start = Instant::now();

This is technically required (variables declared inside the macro block wouldn't be visible later), but it creates a visual inconsistency with the macro used everywhere else. A comment explaining why this particular site must use the attribute directly would help future readers.

7. Gauge for timing instead of Histogram

The batch timing metrics (batch_execution_ms, batch_total_ms, etc.) only expose the most recent batch's value. Over a long sync, Prometheus/Grafana cannot compute percentiles or spot outliers. For a one-shot operation like full sync this is pragmatic, but a Histogram with a few buckets (e.g. 100ms, 500ms, 1s, 5s, 10s) would give much richer observability. Not blocking, but worth considering if sync latency becomes a focus of investigation.


What's good

  • The reset_cycle() method correctly resets only gauges and not IntCounters, respecting Prometheus's monotonicity invariant.
  • The fullsync_metrics! macro cleanly eliminates boilerplate #[cfg(feature = "metrics")] at call sites.
  • The .max(0.001) guard on elapsed time when computing rates (fixing a potential division-by-zero) is the correct approach.
  • The gather_metrics() pattern is consistent with the rest of the codebase.
  • The LazyLock global and the Default → new() delegation follow the existing METRICS_BLOCKS, METRICS_P2P pattern precisely.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. full.rs / full.rs / full.rs / fullsync.rs: fullsync_stage is set to 1/2/3, but only reset to 0 on the happy path at the very end. Any early Ok(()) after header retry exhaustion, or any ? during body download / execution / cleanup, leaves the exporter reporting an active sync stage indefinitely. reset_cycle() also does not clear stage, so the stale state persists until the next sync attempt. This is misleading enough to break alerting and dashboards; stage reset should happen on every exit path, ideally via a guard.

  2. full.rs / fullsync.rs: fullsync_body_failures is defined, exported, and never incremented. The two body-fetch failure paths are request_block_bodies(...).await? and .ok_or(SyncError::BodiesNotFound)?, both of which return immediately without touching the counter. That makes the new reliability metric blind to the main body-download failure mode.

  3. full.rs / full.rs / full.rs: headers_per_second is measured from a timer that is only reset after successful batches. If a header request fails and the code sleeps 5 seconds before retrying, the next successful batch rate includes that backoff time and reports an artificially low throughput. If this metric is meant to represent peer/header download rate rather than end-to-end sync progress, the timer should also be reset on failed attempts.

No consensus, EVM, trie, or validation logic changed in this PR, so I did not find blockchain-correctness or security issues beyond these metrics-path regressions.

I could not run cargo check in this sandbox because dependency resolution is blocked by network/toolchain restrictions.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Arkenan and others added 5 commits April 13, 2026 19:16
…6486)

## Summary
- The Hive consume-engine Amsterdam tests for EIP-7778 and EIP-8037 were
failing because ethrex's per-tx gas limit checks were incompatible with
Amsterdam's new gas accounting rules.
- **EIP-7778** uses pre-refund gas for block accounting, so cumulative
pre-refund gas can exceed the block gas limit even when a block builder
correctly included all transactions.
- **EIP-8037** introduces 2D gas accounting (`block_gas = max(regular,
state)`), meaning cumulative total gas (regular + state) can legally
exceed the block gas limit.
- The fix skips the per-tx cumulative gas check for Amsterdam and adds a
**post-execution** block-level overflow check using `max(sum_regular,
sum_state)` in all three execution paths (sequential, pipeline,
parallel).

## Local test results
- **200/201** EIP-7778 + EIP-8037 Hive consume-engine tests pass
- **105/105** EIP-7778 + EIP-8037 EF blockchain tests pass (4 + 101)
- The single remaining Hive failure
(`test_block_regular_gas_limit[exceed=True]`) expects
`TransactionException.GAS_ALLOWANCE_EXCEEDED` but we return
`BlockException.GAS_USED_OVERFLOW` — the block is correctly rejected,
just with a different error classification.

## Test plan
- [x] All EIP-7778 EF blockchain tests pass locally
- [x] All EIP-8037 EF blockchain tests pass locally
- [x] 200/201 Hive consume-engine Amsterdam tests pass locally
- [ ] Full CI Amsterdam Hive suite passes

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@Arkenan Arkenan marked this pull request as ready for review April 17, 2026 09:05
@Arkenan Arkenan requested a review from a team as a code owner April 17, 2026 09:05
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Apr 17, 2026
@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is my review of PR #6466:


PR #6466feat(l1): fullsync metrics

Overview

Adds a new MetricsFullSync struct with Prometheus gauges/counters tracking full sync phases (header download, body download, EVM execution, merkleization, storage), wires it into the sync cycle and block execution batch, and ships a Grafana dashboard JSON. The approach is consistent with the existing metrics pattern in the codebase.


Correctness Issues

transitions_ms measured but never exported

In crates/blockchain/blockchain.rs (around lines 2235–2238), transitions_ms is measured and included in the [FULLSYNC TIMING] log:

exec={}ms, transitions={}ms, merkle={}ms, store={}ms

However, MetricsFullSync has no batch_transitions_ms field and no corresponding setter. The metric is silently dropped — Grafana will show execution + merkle + store but the state-transition phase is invisible. Either add the field/setter and export it, or remove the measurement if it's intentional to omit it.

body_failures counter is declared but never incremented

MetricsFullSync has body_failures: IntCounter and inc_body_failures() is implemented (fullsync.rs:384), but there are no call sites in full.rs. When request_block_bodies returns None and the code returns SyncError::BodiesNotFound, the failure counter is not bumped. The "Failures per Minute" Grafana panel will always show zero body failures.

interval.elapsed() called twice for different purposes

In blockchain.rs (~line 2267 and ~line 2276):

info!("[FULLSYNC TIMING] ... total={}ms", interval.elapsed().as_millis());
// ...
let elapsed_seconds = interval.elapsed().as_secs_f64(); // second call

The second elapsed() measures a slightly later wall-clock time than the total in the log. More importantly, elapsed_seconds is used for throughput (gigagas/s) calculation — this was the original sole purpose of interval. The two values will diverge slightly. Consider capturing the elapsed value once into a variable before logging.


Design/Consistency Concerns

batch_total_ms measured at two different stack levels

batch_total_ms is set in full.rs:add_blocks_in_batch from execution_start.elapsed(), which wraps the entire call into blockchain.execute_blocks_batch (or equivalent). Meanwhile, the individual phase timings (exec_ms, transitions_ms, merkle_ms, store_ms) are measured inside that called function. The sum of the four internal phases will therefore not equal batch_total_ms — there is overhead between them. This is acceptable if the intent is "wall-clock outer vs inner phase breakdown," but the Grafana stacked bar chart ("Batch Time Breakdown") implies they should sum to the total, which they won't. A comment clarifying the discrepancy, or setting batch_total_ms as the sum of the four phases, would help.

Mixed conditional compilation styles

Some sites use #[cfg(feature = "metrics")] directly:

#[cfg(feature = "metrics")]
let mut header_batch_start = Instant::now();

Others use the fullsync_metrics! macro. The macro is a good DRY solution — it would be cleaner to use it consistently throughout full.rs, including the variable declarations.

set_stage() takes a raw i64

Callers use magic numbers (set_stage(0), set_stage(1), etc.). A local enum (even repr(i64)) would prevent passing an invalid stage value and make call sites self-documenting. The numeric mapping is documented in the metric description string, but a type would enforce it.


Minor Issues

Hardcoded internal instance in Grafana dashboard

fullsync_dashboard.json (line ~984):

"current": { "text": "ethrex-mainnet-8:3701", "value": "ethrex-mainnet-8:3701" }

This commits a specific internal infrastructure host:port into the public repo. The current value in a Grafana template variable is just the "last saved selection" — setting it to "All" or an empty string would be cleaner for a public dashboard.

blocks_executed metric name is ambiguous

The description says "Highest block number executed so far" — so it's a gauge of the current tip, not a count. The name blocks_executed reads like a total count. last_executed_block_number or highest_executed_block would be clearer, though renaming changes the Prometheus metric name and any existing dashboards.

[FULLSYNC TIMING] log always emitted

The structured timing log in blockchain.rs fires regardless of the metrics feature flag. This is probably intentional (timing logs are useful without Prometheus), but it's worth making explicit in a comment if so, since the tag [FULLSYNC TIMING] implies it's metrics-related.


What's Working Well

  • The fullsync_metrics! macro avoids repetitive #[cfg] attribute spam and keeps the call sites readable.
  • reset_cycle() correctly resets only gauges and not counters — counters are intentionally monotonic across cycles.
  • The gather_metrics pattern (create registry per call, register clones, gather) is consistent with the rest of the metrics codebase and works correctly because prometheus types are Arc-backed.
  • Stage transition points (set_stage) are placed logically around the actual phase transitions in sync_cycle_full.
  • The Grafana ETA calculations use clamp_min and and ... > 0 guards to avoid division-by-zero and displaying ETAs when there's no data — this is solid PromQL hygiene.

Summary of Actionable Items

  1. Add batch_transitions_ms: Gauge to MetricsFullSync and export it, or remove the timing measurement if it's intentionally omitted.
  2. Add a inc_body_failures() call site when body fetching fails.
  3. Capture interval.elapsed() once before the log and reuse it for elapsed_seconds.
  4. Clear the hardcoded current instance value from the Grafana dashboard JSON.
  5. Consider calling inc_body_failures() on the BodiesNotFound error path in full.rs.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR adds comprehensive metrics for full sync operations. The implementation is generally sound, but I have several specific observations:

Issues Found

1. Unused body_failures Counter (Minor)

File: crates/networking/p2p/sync/full.rs

The body_failures counter is defined in MetricsFullSync but never incremented. You increment header_failures when header fetches fail (line 89), but there's no corresponding increment for body fetch failures.

Suggestion: Add fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()); when peers.request_block_bodies() returns Ok(None) or when the body request errors.

2. Inconsistent Instant Types (Style)

File: crates/blockchain/blockchain.rs (lines 2193, 2194)

You mix tokio::time::Instant (imported at the top of the file as use tokio::time::Instant;) with std::time::Instant. While functionally similar, this is inconsistent. Since the existing interval variable uses tokio::time::Instant, consider using that consistently throughout.

Current:

let interval = std::time::Instant::now();  // Changed from tokio::time::Instant
let exec_start = std::time::Instant::now();

3. Potential Precision Loss in Cast (Minor)

File: crates/blockchain/metrics/fullsync.rs (lines 218-220)

The cast_signed() method (presumably converting u64 to i64 for Prometheus gauges) could theoretically overflow if Ethereum block numbers exceed i64::MAX (≈9×10¹⁸). While practically impossible for Ethereum mainnet, consider documenting this assumption or using saturating conversion.

4. Missing batch_total_ms Update in Blockchain (Minor)

File: crates/blockchain/blockchain.rs

You set batch_execution_ms, batch_merkle_ms, and batch_store_ms, but never set METRICS_FULLSYNC.set_batch_total_ms(). The interval.elapsed().as_millis() is calculated for the log message but not stored as a metric.

Suggestion: Add after line 2280:

METRICS_FULLSYNC.set_batch_total_ms(interval.elapsed().as_millis() as f64);

5. Registry Recreation Overhead (Performance)

File: crates/blockchain/metrics/fullsync.rs (lines 282-312)

The gather_metrics() method creates a new Registry and registers all metrics on every call. This follows the existing pattern in the codebase, but be aware this allocates on each metrics scrape. Consider registering once in LazyLock initialization if this becomes a bottleneck.

Positive Observations

  1. Good use of the fullsync_metrics! macro in full.rs to reduce #[cfg(feature = "metrics")] boilerplate throughout the sync logic.

  2. Proper counter/gauge distinction in reset_cycle(): You correctly reset gauges (like blocks_executed) but not counters (like headers_downloaded) which should accumulate across cycles.

  3. Correct timestamp handling: Using SystemTime for Unix epoch timestamps and Instant for durations is the right approach.

  4. Defensive division: Using .max(0.001) to avoid division by zero in rate calculations (lines 175, 297) is good practice.

Nitpicks

  • Line 2283 in blockchain.rs: The log message uses blocks_len as a pseudo-batch-ID, but this is just the count. Consider adding an actual batch sequence number if you need to correlate log lines across batches.
  • Line 136 in full.rs: header_batch_start is reset at the end of the metrics block, which correctly measures the inter-batch interval including processing overhead.

The Grafana dashboard JSON looks comprehensive and follows standard Prometheus conventions.

Overall: No critical bugs or security issues. The code is ready to merge after addressing the unused body_failures counter and potentially the missing batch_total_ms metric update.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. fullsync_body_failures is dead code right now. The counter is defined and exposed, but the body fetch path never increments it, even when request_block_bodies(...) exhausts retries and returns BodiesNotFound. That means the new reliability panel will report 0 body failures while full sync is actually failing on bodies. Increment it on the None -> BodiesNotFound path at minimum, and probably on recoverable body-fetch retries too if you want parity with header failures. full.rs fullsync.rs

  2. execution_stage_start_timestamp is started before body download begins, but the stage machine still reports stage 2 = downloading_bodies until later. As a result, the dashboard’s “Execution Stage Elapsed” panel measures body download + execution, not execution only. Either move set_execution_stage_start_now() to the same point as set_stage(3), or rename the metric/panel to reflect that it covers the whole post-header phase. full.rs full.rs fullsync_dashboard.json

  3. The batch timing breakdown is internally inconsistent. batch_execution_ms now stops before get_state_transitions(), while batch_total_ms is measured outside add_blocks() and also includes later work such as forkchoice_update(). The stacked “Body Download / Execution / Merkle / Storage” panel will therefore undercount the real batch total, which makes bottleneck analysis misleading. Either include transition extraction in execution, or export separate transitions/forkchoice timings so the dashboard sums to the same boundary as total. blockchain.rs blockchain.rs full.rs fullsync_dashboard.json

No consensus, EVM, gas-accounting, or state-transition correctness issues stood out in the diff itself; the problems I found are observability/diagnostics. I wasn’t able to run cargo check in this sandbox because rustup tried to write outside the writable roots.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR adds Prometheus metrics and a Grafana dashboard for the full sync process, tracking stage transitions, header/body download rates, block execution progress, and per-phase batch timing (EVM execution, merkle, storage).

  • P1 — fullsync_body_failures always 0: inc_body_failures() is declared and exposed but never called; body download errors propagate via ? without updating the counter, making the metric misleading in the dashboard.
  • P2 — transitions_ms missing from metrics: state-transition extraction time is measured and logged but absent from both MetricsFullSync and the Grafana stacked-bar chart, so the bars will not sum to the total batch time.
  • P2 — unconditional timing overhead: timing Instants and [FULLSYNC TIMING] info! logs in blockchain.rs and full.rs are compiled and run regardless of the metrics feature, contrary to the pattern established by the metrics! macro.

Confidence Score: 4/5

Safe to merge after fixing the body_failures counter — the rest are P2 improvements.

One P1 defect: the body_failures Prometheus counter is declared, registered, and displayed in the Grafana dashboard, but inc_body_failures() is never called, so it will always report 0 regardless of actual failures. The remaining findings are P2 (log noise without metrics feature, missing transitions metric in chart, hardcoded dashboard instance).

crates/networking/p2p/sync/full.rs (body_failures never incremented), crates/blockchain/blockchain.rs (unconditional timing + missing transitions metric)

Important Files Changed

Filename Overview
crates/blockchain/metrics/fullsync.rs New Prometheus metrics struct for full sync; comprehensive coverage of stage, progress, rates, and batch timings, but missing a batch_transitions_ms gauge that is measured in the calling code.
crates/blockchain/blockchain.rs Adds per-phase timing instrumentation and metrics emission; timing variables and [FULLSYNC TIMING] log are unconditional (not gated by metrics feature), and transitions_ms is logged but never emitted as a Prometheus metric.
crates/networking/p2p/sync/full.rs Wires up sync-cycle metrics (stage transitions, header/body rates, batch timing); body_failures counter is exposed but never incremented on body download errors.
crates/blockchain/metrics/api.rs Registers METRICS_FULLSYNC in the Prometheus scrape endpoint; follows the same pattern as existing metrics modules.
crates/blockchain/metrics/mod.rs Adds fullsync submodule gated by api or metrics feature, consistent with existing modules.
metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json New Grafana dashboard with stage, progress, header/body/execution panels; has a hardcoded production instance in the template default value.

Sequence Diagram

sequenceDiagram
    participant SC as sync_cycle_full
    participant AB as add_blocks_in_batch
    participant BC as execute_blocks_batch
    participant ME as METRICS_FULLSYNC

    SC->>ME: reset_cycle(), set_stage(1), inc_cycles_started()
    loop Header download batches
        SC->>ME: set_target_block, set_lowest_header, inc_headers_downloaded
    end
    SC->>ME: set_blocks_total(), set_stage(2), set_execution_stage_start_now()

    loop Execution batches
        SC->>ME: set_batch_body_download_ms, set_bodies_per_second, set_stage(3)
        SC->>AB: add_blocks_in_batch()
        AB->>BC: execute_blocks_batch()
        BC->>ME: set_batch_execution_ms (exec_ms)
        Note over BC,ME: transitions_ms measured but NOT set in metrics
        BC->>ME: set_batch_merkle_ms, set_batch_store_ms
        AB->>ME: set_blocks_executed, set_blocks_per_second, set_batch_total_ms, set_batch_size
    end

    SC->>ME: set_stage(0), inc_cycles_completed()
Loading

Comments Outside Diff (1)

  1. metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json, line 980-987 (link)

    P2 Hardcoded production instance in dashboard template variable

    The current field of the instance template variable is hardcoded to "ethrex-mainnet-8:3701". Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json
    Line: 980-987
    
    Comment:
    **Hardcoded production instance in dashboard template variable**
    
    The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 170-175

Comment:
**`body_failures` metric never incremented**

`inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems.

```suggestion
            let bodies = peers
                .request_block_bodies(header_batch)
                .await
                .inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
                ?
                .ok_or_else(|| {
                    fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
                    SyncError::BodiesNotFound
                })?;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2190-2293

Comment:
**Timing instrumentation always compiled regardless of `metrics` feature**

All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments.

Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2235-2238

Comment:
**`transitions_ms` measured and logged but not exposed as a Prometheus metric**

`transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json
Line: 980-987

Comment:
**Hardcoded production instance in dashboard template variable**

The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.

```suggestion
        "current": {},
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge branch 'main' into fullsync-metric..." | Re-trigger Greptile

Comment on lines 170 to 175
while !headers.is_empty() {
let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())];
let bodies = peers
.request_block_bodies(header_batch)
.await?
.ok_or(SyncError::BodiesNotFound)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 body_failures metric never incremented

inc_body_failures() is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When request_block_bodies returns None or an error, the ? propagates immediately and the counter stays at 0. Operators will see a perpetually-zero fullsync_body_failures gauge even during body download problems.

Suggested change
while !headers.is_empty() {
let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())];
let bodies = peers
.request_block_bodies(header_batch)
.await?
.ok_or(SyncError::BodiesNotFound)?;
let bodies = peers
.request_block_bodies(header_batch)
.await
.inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
?
.ok_or_else(|| {
fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
SyncError::BodiesNotFound
})?;
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 170-175

Comment:
**`body_failures` metric never incremented**

`inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems.

```suggestion
            let bodies = peers
                .request_block_bodies(header_batch)
                .await
                .inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
                ?
                .ok_or_else(|| {
                    fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
                    SyncError::BodiesNotFound
                })?;
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 2190 to +2293
@@ -2232,9 +2235,13 @@ impl Blockchain {
tokio::task::yield_now().await;
}

let exec_ms = exec_start.elapsed().as_millis();

let transitions_start = std::time::Instant::now();
let account_updates = vm
.get_state_transitions()
.map_err(|err| (ChainError::EvmError(err), None))?;
let transitions_ms = transitions_start.elapsed().as_millis();

let last_block = blocks
.last()
@@ -2244,11 +2251,13 @@ impl Blockchain {
let last_block_gas_limit = last_block.header.gas_limit;

// Apply the account updates over all blocks and compute the new state root
let merkle_start = std::time::Instant::now();
let account_updates_list = self
.storage
.apply_account_updates_batch(first_block_header.parent_hash, &account_updates)
.map_err(|e| (e.into(), None))?
.ok_or((ChainError::ParentStateNotFound, None))?;
let merkle_ms = merkle_start.elapsed().as_millis();

let new_state_root = account_updates_list.state_trie_hash;
let state_updates = account_updates_list.state_updates;
@@ -2258,6 +2267,7 @@ impl Blockchain {
// Check state root matches the one in block header
validate_state_root(&last_block.header, new_state_root).map_err(|e| (e, None))?;

let store_start = std::time::Instant::now();
let update_batch = UpdateBatch {
account_updates: state_updates,
storage_updates: accounts_updates,
@@ -2270,6 +2280,17 @@ impl Blockchain {
self.storage
.store_block_updates(update_batch)
.map_err(|e| (e.into(), None))?;
let store_ms = store_start.elapsed().as_millis();

info!(
"[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms",
blocks_len,
exec_ms,
transitions_ms,
merkle_ms,
store_ms,
interval.elapsed().as_millis()
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Timing instrumentation always compiled regardless of metrics feature

All four timing Instants (exec_start, transitions_start, merkle_start, store_start) and the [FULLSYNC TIMING] info! log are unconditional — they run even when the metrics feature is disabled. The existing metrics! macro is specifically designed to gate this overhead. Similarly, body_download_start and its [FULLSYNC TIMING] info log in full.rs fire unconditionally. On a busy node this adds one info! line per batch and per body-download round, which can be verbose in non-metrics deployments.

Consider wrapping both the timing variables and the log under metrics!(...) / fullsync_metrics!(...), or demoting the log to debug!.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2190-2293

Comment:
**Timing instrumentation always compiled regardless of `metrics` feature**

All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments.

Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 2235 to +2238
tokio::task::yield_now().await;
}

let exec_ms = exec_start.elapsed().as_millis();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 transitions_ms measured and logged but not exposed as a Prometheus metric

transitions_ms (time spent in vm.get_state_transitions()) is included in the [FULLSYNC TIMING] log and clearly matters for profiling, but there is no corresponding batch_transitions_ms gauge in MetricsFullSync. The Grafana "Batch Time Breakdown" stacked-bar chart sums body_download + execution + merkle + store, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2235-2238

Comment:
**`transitions_ms` measured and logged but not exposed as a Prometheus metric**

`transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants