Skip to content

fix(l1): adding block by block exec as a fallback#6464

Merged
Arkenan merged 4 commits intomainfrom
fullsync-fallback
Apr 21, 2026
Merged

fix(l1): adding block by block exec as a fallback#6464
Arkenan merged 4 commits intomainfrom
fullsync-fallback

Conversation

@Arkenan
Copy link
Copy Markdown
Collaborator

@Arkenan Arkenan commented Apr 10, 2026

We detected that we were consistently failing fullsync at hoodi block 443,055 due to batching reasons. This PR adds pipelined block by block execution as a fallback when a single batch fails. After this stability fix, hoodi full sync went past that block without issues. This should improve Full sync stability in general.

@Arkenan Arkenan changed the title Adding block by block exec as a fallback fix(l1): adding block by block exec as a fallback Apr 10, 2026
@github-actions github-actions Bot added the L1 Ethereum client label Apr 10, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

The PR implements a fallback mechanism for batch block execution failures, which is a pragmatic workaround for state cache corruption bugs. Overall the code is correct and well-structured.

Issues and Suggestions:

  1. Performance concern with unconditional cloning (line 286)

    match blockchain
        .add_blocks_in_batch(blocks.clone(), cancel_token)  // clones entire Vec

    If this is a large batch, cloning all blocks upfront is expensive even when batch execution succeeds (the common case). Since add_blocks_in_batch likely takes Vec<Block> by value, consider whether the API could accept &[Block] to avoid the clone, or document why the clone is necessary (workaround rarity justifies the cost).

  2. Missing cancellation support in fallback (lines 295, 308-331)
    run_blocks_pipeline doesn't accept the cancel_token, meaning if batch fails with InvalidBlock, the fallback execution cannot be cancelled. This is consistent with the original sync_head_found = true path, but consider adding cancellation support to the pipeline for consistency.

  3. Warning log context (line 293-294)

    warn!(
        "Batch execution failed with InvalidBlock ({err}), retrying batch with per-block pipeline execution"
    );

    Consider including block height/hash range in the log to help debugging:

    warn!(
        first = ?blocks.first().map(|b| b.header.number),
        last = ?blocks.last().map(|b| b.header.number),
        "Batch execution failed with InvalidBlock ({err}), falling back to pipeline execution"
    );

Positive aspects:

  • Correct error discrimination: The match on ChainError::InvalidBlock(ref err) (line 292) specifically targets the corruption case without retrying on other errors (like IO or consensus failures).
  • Proper error propagation: The run_blocks_pipeline extraction correctly preserves BatchBlockProcessingFailure metadata including last_valid_hash.
  • Thread safety: Correct use of spawn_blocking for CPU-intensive sequential execution.
  • Documentation: The updated comment (lines 270-272) clearly explains the fallback rationale.

Verdict: Approve with minor suggestions. The workaround is appropriately targeted and the refactoring improves maintainability.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything needed for a thorough review.


Review: fix(l1): adding block by block exec as a fallback

Overall: The fix is pragmatic and correctly addresses a real production issue. The logic is sound. A few concerns worth addressing, one of them is on every hot-path call.


Correctness

The control flow is correct:

  • blocks.clone() is passed to add_blocks_in_batch; the original blocks is moved into the fallback — no double-move, no use-after-move.
  • The ref err borrow in the match arm correctly avoids consuming the ChainError while still using it in the warn! call.
  • Error propagation after the fallback reaches the existing invalid_ancestor marking logic in add_blocks_in_batch (the outer function, line 214–237) unchanged.

Performance concern (noteworthy)

blocks.clone() is unconditional (line 287):

match blockchain
    .add_blocks_in_batch(blocks.clone(), cancel_token)  // clone on every call
    .await

This clones the entire batch on every successful execution — which is the common case. For large batches (each Block carries headers + full transaction bodies), this doubles the peak memory allocation on the hot path.

A less expensive alternative: have add_blocks_in_batch return the blocks on the Err path so they can be reused without a pre-emptive clone. Something like:

// Hypothetical: add_blocks_in_batch returns blocks on Err
Err((err, batch_failure)) => {
    match err {
        ChainError::InvalidBlock(_) => {
            warn!(...);
            run_blocks_pipeline(blockchain, batch_failure.blocks).await
        }
        ...
    }
}

If changing that signature is out of scope, a comment at the clone() call site explaining why it's needed (to preserve blocks for fallback) would help future readers avoid thinking it's accidental.


Implicit assumption worth documenting

The fallback works correctly only if add_blocks_in_batch leaves no partial state in durable storage when it returns InvalidBlock. The PR description explains this is a cache corruption issue (not storage), which supports the assumption. It would be worth a short inline comment near the fallback call:

// Safe to retry: batch mode failure due to state cache corruption
// does not persist partial state to storage.
run_blocks_pipeline(blockchain, blocks).await

Without this, a future maintainer fixing an unrelated InvalidBlock code path in batch mode might unknowingly introduce a case where retrying sees dirty storage.


Silent retry on genuinely invalid blocks

When a block is truly invalid (not a cache false-positive), the flow is:

  1. Batch execution fails with InvalidBlock → warn + retry
  2. Pipeline execution also fails with InvalidBlock → returns error

This doubles execution cost for genuinely invalid blocks. That's acceptable, but the warning message says "retrying batch with per-block pipeline execution" without any hint that the block might actually be invalid. If the pipeline also fails, the caller's error handling (line 221–235) will correctly mark descendants as having an invalid ancestor, so no correctness issue — just slightly misleading logging in the double-failure case.


Minor: cancel_token not propagated to pipeline fallback

cancel_token is consumed by add_blocks_in_batch and never reaches run_blocks_pipeline. This means a cancellation request during the fallback path will be ignored until the pipeline finishes or errors. This is pre-existing behavior (the old sync_head_found branch also ignored cancellation), so not a regression — but worth a follow-up issue.


Summary

Item Severity Action
Unconditional blocks.clone() on hot path Medium Optimize or document
Implicit assumption about clean storage on retry Low Add inline comment
Silent double-execution on genuinely invalid blocks Low Acceptable; minor log improvement optional
cancel_token not propagated to pipeline Low Pre-existing; track separately

The fix is correct and the production evidence (Hoodi sync passing block 443,055) validates the approach. The clone cost is the most actionable item before merging.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Lines of code report

Total lines added: 43
Total lines removed: 0
Total lines changed: 43

Detailed view
+-------------------------------------------+-------+------+
| File                                      | Lines | Diff |
+-------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs | 292   | +43  |
+-------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. full.rs:291 retries on every ChainError::InvalidBlock, but that enum also covers pre-execution failures like InvalidHeader, InvalidBody, size/blob/gas mismatches. Those cannot be fixed by switching execution mode, so an untrusted peer can force the node to execute the whole batch twice before you still reject it. That is a sync-time CPU amplification path. The fallback should be narrowed to the known batch-only false positives instead of all InvalidBlock variants.

  2. full.rs:298 and full.rs:304 drop cancellation handling on the retry path. add_blocks_in_batch() checks cancel_token per block, but run_blocks_pipeline() does not, so once a batch falls back the node can no longer stop promptly until the full sequential replay finishes. On a large sync batch this is a noticeable shutdown/restart regression.

No other correctness issues stood out in the diff itself. I could not run cargo check here because the sandbox blocks rustup temp-file creation in /home/runner/.rustup/tmp.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@Arkenan Arkenan marked this pull request as ready for review April 21, 2026 14:49
@Arkenan Arkenan requested a review from a team as a code owner April 21, 2026 14:49
@ethrex-project-sync ethrex-project-sync Bot moved this to In Review in ethrex_l1 Apr 21, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall Assessment: The PR implements a reasonable workaround for batch execution state corruption by falling back to single-block pipeline execution on specific post-execution errors. The logic is sound but has minor efficiency and observability considerations.

Issues and Suggestions:

1. Missing Import for warn! (Line 295)
The code uses warn! but no logging import is added in the diff. Ensure use tracing::warn; (or equivalent) is present in the file scope, or this will fail to compile.

2. Efficiency: Re-execution from Batch Start (Lines 288-306)
When falling back to run_blocks_pipeline, the entire blocks vector is re-processed from the first block, even if the batch failed at block N/2. This is safe but inefficient for large batches. Consider documenting this trade-off or optimizing to skip already-validated prefix blocks if the store allows idempotent adds.

3. DoS Vector: Double Validation Cost
is_post_execution_error matches errors that could indicate either batch corruption OR genuinely invalid blocks. Attackers feeding invalid blocks triggering StateRootMismatch will cause double execution (batch + pipeline) before rejection. Consider adding a metric or limiting retry attempts per peer.

4. Documentation for is_post_execution_error (Lines 316-327)
Add a doc comment explaining why these specific errors indicate potential batch-mode cache pollution rather than genuine invalidity:

/// Returns true for errors that arise from EVM execution...
/// Note: These errors may also indicate genuinely invalid blocks; 
/// retrying them trades CPU efficiency for correctness guarantees.

5. Cancel Token Propagation
run_blocks_pipeline does not accept the cancel_token, meaning the fallback path cannot be cancelled once started. While this matches the original sequential behavior (which also ignored the token), verify this is intentional for sync safety.

6. Clone Cost Assumption (Line 285-286)
The comment assumes clone cost is "~1-5ms". For large blocks (post-EIP-4844 with many blobs), this could be higher. Consider clarifying this is an average or measured on specific hardware.

7. Error Pattern Exhaustiveness
The matches! in is_post_execution_error will fail to compile if InvalidBlockError gains new variants. This is desirable (forces review), but ensure the team is aware new post-execution errors must be added here.

Nitpick:

  • Line 270: Update comment "block's state" → "blocks' state" (possessive plural).

Security Note:
The fallback preserves safety: even if batch mode corrupts state, the single-block pipeline (using fresh state per block) acts as a correctness oracle. No invalid state can be committed because the pipeline will fail on the same corrupted block.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings:

  1. The new fallback path is no longer cancellable, so a shutdown can still leave the node executing an entire retry batch. In full.rs the retry switches to run_blocks_pipeline, but that helper has no CancellationToken checks at all. The original batch path still checks cancellation on every block in blockchain.rs. For full sync this is a real regression in responsiveness and can pin a blocking worker on a large batch after shutdown has already been requested.

  2. BlobGasUsedMismatch is classified as a “post-execution” retry case, but it is actually raised during pre-execution validation. The new matcher in full.rs includes InvalidBlockError::BlobGasUsedMismatch, while the error is produced by verify_blob_gas_usage before vm.execute_block runs. That means malformed Cancun/Prague blocks will always be executed twice before rejection, which contradicts the helper’s comment and adds avoidable sync cost.

I did not run cargo check; the environment blocks rustup temp-file creation (/home/runner/.rustup/tmp is read-only). Other than the two points above, the fallback approach itself looks reasonable.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR adds a pipeline-execution fallback in full sync: when add_blocks_in_batch fails with a post-execution error (state root / receipts / gas mismatch), the same block range is retried using add_block_pipeline one block at a time, which uses a fresh state per block and avoids cross-block cache pollution. The batch approach is safe to retry because add_blocks_in_batch is all-or-nothing — state is never committed until store_block_updates succeeds at the very end, so the fallback starts with a clean slate.

  • is_post_execution_error does not include InvalidTransaction, which can also be triggered by batch-mode state pollution (via EvmError::Transaction); that class of failures won't benefit from the fallback.

Confidence Score: 4/5

Safe to merge; one minor gap in the fallback coverage but no regression risk

The change is well-reasoned and confirmed to work in production (hoodi sync past block 443,055). The all-or-nothing semantics of add_blocks_in_batch ensure the pipeline fallback always starts from a clean state. The only finding is that InvalidTransaction from EVM execution (another manifestation of the same cache-pollution bug) is not covered by is_post_execution_error, limiting the fix's effectiveness in that scenario without causing any regression.

crates/networking/p2p/sync/full.rs — specifically the is_post_execution_error filter

Important Files Changed

Filename Overview
crates/networking/p2p/sync/full.rs Adds a pipeline-execution fallback when batch execution fails with a post-execution error; refactors sequential block execution into run_blocks_pipeline; is_post_execution_error misses InvalidTransaction which can also arise from EVM state corruption

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[add_blocks called] --> B{sync_head_found?}
    B -- yes --> C[run_blocks_pipeline]
    B -- no --> D[add_blocks_in_batch\nbatch execution]
    D --> E{Result?}
    E -- Ok --> F[✅ Done]
    E -- "Err(InvalidBlock + is_post_execution_error)" --> G["⚠️ warn + log failed block"]
    G --> C
    E -- "Err(other)" --> H[❌ Propagate error]
    C --> I{Pipeline result?}
    I -- Ok --> F
    I -- Err --> H
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 327-337

Comment:
**`InvalidTransaction` missing from post-execution error filter**

`InvalidTransaction` can originate from EVM execution (via `EvmError::Transaction → ChainError::InvalidBlock(InvalidBlockError::InvalidTransaction(...))`). If batch-mode state-cache corruption causes an account's nonce or balance to appear wrong mid-batch, the block's transaction will fail with `InvalidTransaction`, but `is_post_execution_error` returns `false` for it — so the pipeline fallback never fires. The same root cause (cross-block shared state) that produces `StateRootMismatch` can also produce `InvalidTransaction`, yet only the former triggers the retry.

```suggestion
fn is_post_execution_error(err: &InvalidBlockError) -> bool {
    matches!(
        err,
        InvalidBlockError::GasUsedMismatch(_, _)
            | InvalidBlockError::StateRootMismatch
            | InvalidBlockError::ReceiptsRootMismatch
            | InvalidBlockError::RequestsHashMismatch
            | InvalidBlockError::BlockAccessListHashMismatch
            | InvalidBlockError::BlobGasUsedMismatch
            | InvalidBlockError::InvalidTransaction(_)
    )
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge branch 'main' into fullsync-fallba..." | Re-trigger Greptile

Comment on lines +327 to +337
fn is_post_execution_error(err: &InvalidBlockError) -> bool {
matches!(
err,
InvalidBlockError::GasUsedMismatch(_, _)
| InvalidBlockError::StateRootMismatch
| InvalidBlockError::ReceiptsRootMismatch
| InvalidBlockError::RequestsHashMismatch
| InvalidBlockError::BlockAccessListHashMismatch
| InvalidBlockError::BlobGasUsedMismatch
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 InvalidTransaction missing from post-execution error filter

InvalidTransaction can originate from EVM execution (via EvmError::Transaction → ChainError::InvalidBlock(InvalidBlockError::InvalidTransaction(...))). If batch-mode state-cache corruption causes an account's nonce or balance to appear wrong mid-batch, the block's transaction will fail with InvalidTransaction, but is_post_execution_error returns false for it — so the pipeline fallback never fires. The same root cause (cross-block shared state) that produces StateRootMismatch can also produce InvalidTransaction, yet only the former triggers the retry.

Suggested change
fn is_post_execution_error(err: &InvalidBlockError) -> bool {
matches!(
err,
InvalidBlockError::GasUsedMismatch(_, _)
| InvalidBlockError::StateRootMismatch
| InvalidBlockError::ReceiptsRootMismatch
| InvalidBlockError::RequestsHashMismatch
| InvalidBlockError::BlockAccessListHashMismatch
| InvalidBlockError::BlobGasUsedMismatch
)
}
fn is_post_execution_error(err: &InvalidBlockError) -> bool {
matches!(
err,
InvalidBlockError::GasUsedMismatch(_, _)
| InvalidBlockError::StateRootMismatch
| InvalidBlockError::ReceiptsRootMismatch
| InvalidBlockError::RequestsHashMismatch
| InvalidBlockError::BlockAccessListHashMismatch
| InvalidBlockError::BlobGasUsedMismatch
| InvalidBlockError::InvalidTransaction(_)
)
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 327-337

Comment:
**`InvalidTransaction` missing from post-execution error filter**

`InvalidTransaction` can originate from EVM execution (via `EvmError::Transaction → ChainError::InvalidBlock(InvalidBlockError::InvalidTransaction(...))`). If batch-mode state-cache corruption causes an account's nonce or balance to appear wrong mid-batch, the block's transaction will fail with `InvalidTransaction`, but `is_post_execution_error` returns `false` for it — so the pipeline fallback never fires. The same root cause (cross-block shared state) that produces `StateRootMismatch` can also produce `InvalidTransaction`, yet only the former triggers the retry.

```suggestion
fn is_post_execution_error(err: &InvalidBlockError) -> bool {
    matches!(
        err,
        InvalidBlockError::GasUsedMismatch(_, _)
            | InvalidBlockError::StateRootMismatch
            | InvalidBlockError::ReceiptsRootMismatch
            | InvalidBlockError::RequestsHashMismatch
            | InvalidBlockError::BlockAccessListHashMismatch
            | InvalidBlockError::BlobGasUsedMismatch
            | InvalidBlockError::InvalidTransaction(_)
    )
}
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have all the context I need for a thorough review.


PR Review: fix(l1): adding block by block exec as a fallback (#6464)

Overview

This PR addresses a recurring full-sync failure at hoodi block 443,055 caused by cross-block state cache pollution in batch execution mode. The fix: if batch execution fails with a post-execution error (e.g., StateRootMismatch), fall back to per-block pipeline execution, which uses fresh VM state per block. The original sequential path is also extracted into run_blocks_pipeline to reduce duplication.


Correctness

Fallback atomicity — correct. The key safety property holds: add_blocks_in_batch only writes to storage via store_block_updates after all blocks succeed (blockchain.rs:2270). If it returns Err, no state has been committed, so the pipeline fallback can safely re-execute all blocks from a clean base. This is the load-bearing correctness invariant of the PR.

Match arm ownership — correct. The ref err / ref batch_failure pattern borrows from the Err tuple, and the Err(e) => Err(e) catch-all still receives an owned value when the guard fails. Rust handles this correctly.

is_post_execution_error gap — potential issue. InvalidBlockError::InvalidTransaction(String) is not included. However, this variant is the target of From<EvmError> for EvmError::Transaction(...) (blockchain/error.rs). If state cache pollution causes a transaction to fail (wrong balance or nonce due to leaked state from a prior block in the batch), the error surfaces as InvalidBlock(InvalidTransaction(...)) — and the fallback won't trigger. The initial bug manifested as StateRootMismatch, so this gap didn't matter here, but it could suppress useful retries in future corruption patterns. Worth explicitly noting in a comment or tracking as a follow-up.


Performance

blocks.clone() on every non-final batch (full.rs:293). This clones up to EXECUTE_BATCH_SIZE (1024) blocks every execution pass, not just when the fallback triggers. The in-code comment argues this is ~1-5ms vs ~29s for batch execution, which is plausible for hoodi's block density. But this is paid unconditionally even on the happy path. A minor concern for busier chains (mainnet-dense blocks with many transactions could make this more expensive). Not a blocker, but the assumption should survive mainnet testing.


Code Quality

Positive:

  • Extracting run_blocks_pipeline is clean — removes an exact code duplicate.
  • is_post_execution_error is well-named and well-documented. The comment explaining the exclusion rationale for pre-execution errors is exactly the right level of detail.
  • The warn! log on fallback is informative and includes the block number and hash.

Minor nits:

  • full.rs:309 — blocks.iter().find(|b| b.hash() == f.failed_block_hash) recomputes hashes for every block in the list. Only triggered on the error path so the cost is negligible, but batch_failure.failed_block_hash is already the hash — the only new information needed is the block number. Since BatchBlockProcessingFailure doesn't carry it, this is fine as-is.

  • full.rs:297 — the pattern Err((ChainError::InvalidBlock(ref err), ref batch_failure)) if is_post_execution_error(err) works correctly, but if InvalidBlockError gets new variants in the future, they'll silently fall through to Err(e) => Err(e) without a fallback retry. This is the desired default behavior, but is_post_execution_error should be kept in sync with new post-execution variants as the enum grows.


Test Coverage

No tests are added. The fix is validated empirically (hoodi sync past block 443,055). A unit test for is_post_execution_error would be valuable to protect against regression — it's a pure function that's easy to test exhaustively:

#[test]
fn post_execution_error_classification() {
    assert!(is_post_execution_error(&InvalidBlockError::StateRootMismatch));
    assert!(is_post_execution_error(&InvalidBlockError::ReceiptsRootMismatch));
    assert!(!is_post_execution_error(&InvalidBlockError::InvalidBody(...)));
    // etc.
}

This would also serve as living documentation of which errors are intentionally excluded.


Summary

The approach is sound. The fallback is safe (no partial storage writes to undo), correctly scoped to post-execution errors, and well-commented. The main items worth tracking:

  1. InvalidBlockError::InvalidTransaction may warrant inclusion in is_post_execution_error if state-pollution-induced tx failures become a observed failure mode.
  2. The unconditional blocks.clone() cost is acceptable now but should be re-evaluated under mainnet-level block density.
  3. A unit test for is_post_execution_error would be a low-effort quality improvement.

Automated review by Claude (Anthropic) · sonnet · custom prompt

blocks: Vec<Block>,
) -> Result<(), (ChainError, Option<BatchBlockProcessingFailure>)> {
tokio::task::spawn_blocking(move || {
let mut last_valid_hash = H256::default();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first block fails, the 'last good' value is zero which is unintuitive.

Comment on lines +341 to +344
async fn run_blocks_pipeline(
blockchain: Arc<Blockchain>,
blocks: Vec<Block>,
) -> Result<(), (ChainError, Option<BatchBlockProcessingFailure>)> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also use the cancel_token

Comment on lines +330 to +338
matches!(
err,
InvalidBlockError::GasUsedMismatch(_, _)
| InvalidBlockError::StateRootMismatch
| InvalidBlockError::ReceiptsRootMismatch
| InvalidBlockError::RequestsHashMismatch
| InvalidBlockError::BlockAccessListHashMismatch
| InvalidBlockError::BlobGasUsedMismatch
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the worry is invalid intermediate state, InvalidTransaction (which includes 'insufficient funds') could also be caused by bad state.

@Arkenan Arkenan enabled auto-merge April 21, 2026 15:54
@Arkenan Arkenan added this pull request to the merge queue Apr 21, 2026
Merged via the queue into main with commit 1a2ec50 Apr 21, 2026
57 checks passed
@Arkenan Arkenan deleted the fullsync-fallback branch April 21, 2026 16:37
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants