Skip to content

feat(l1): replace trie layer cache with LRU cache for full sync#6492

Draft
azteca1998 wants to merge 3 commits intomainfrom
feat/lru-trie-cache-fullsync
Draft

feat(l1): replace trie layer cache with LRU cache for full sync#6492
azteca1998 wants to merge 3 commits intomainfrom
feat/lru-trie-cache-fullsync

Conversation

@azteca1998
Copy link
Copy Markdown
Contributor

Summary

  • Adds FlatTrieCache: an LRU-based trie node cache (2M entries) for full sync batch mode, bypassing the diff-layer chain, bloom filter rebuild, and RCU overhead
  • Adds TrieCacheRef enum so TrieWrapper can dispatch to either cache transparently
  • In batch mode, apply_trie_updates takes a fast path: writes nodes directly to disk and populates the LRU, skipping layer accumulation entirely
  • add_blocks_in_batch enables the flat cache before execution and disables it after
  • Normal CL block processing is completely unchanged — trie_cache_ref() returns Layered when no batch cache is active
  • The bloom rebuild that takes ~25% of merkle spikes is eliminated in batch mode

Closes #6480

Test plan

  • Run full sync on hoodi with FULL_SYNC_BLOCK_LIMIT=50000 and verify completion
  • Compare merkle computation time with and without the LRU cache (check [METRICS] logs)
  • Verify normal CL block processing still uses layered cache (check after sync completes)
  • Monitor memory usage — LRU should stay bounded at ~300-600 MB

… mode (#6480)

During full sync (add_blocks_in_batch), the TrieLayerCache's diff-layer
chain, bloom filter, and RCU cloning are pure overhead since full sync
never reorgs. This commit adds a FlatTrieCache backed by an LRU that
bypasses all of that machinery:

- New FlatTrieCache in layering.rs: simple LRU keyed by trie node path,
  no layers, no bloom, no parent pointers.
- New TrieCacheRef enum: allows TrieWrapper to use either the layered
  cache (normal CL processing) or the flat LRU cache (batch mode)
  transparently.
- In batch mode, apply_trie_updates writes trie nodes directly to disk
  and populates the LRU, instead of accumulating diff layers and doing
  periodic commits with bloom rebuilds.
- add_blocks_in_batch enables the flat cache before execution and
  disables it after, so normal block processing is completely unchanged.
@azteca1998 azteca1998 changed the title feat(storage): replace trie layer cache with LRU cache for full sync feat(l1): replace trie layer cache with LRU cache for full sync Apr 16, 2026
@github-actions github-actions Bot added the L1 Ethereum client label Apr 16, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

Lines of code report

Total lines added: 166
Total lines removed: 0
Total lines changed: 166

Detailed view
+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2503  | +16  |
+----------------------------------------+-------+------+
| ethrex/crates/storage/layering.rs      | 286   | +85  |
+----------------------------------------+-------+------+
| ethrex/crates/storage/store.rs         | 2657  | +65  |
+----------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall Assessment: The PR introduces a well-structured optimization for full sync batch processing. The abstraction using TrieCacheRef is clean, and the lifecycle management in add_blocks_in_batch correctly handles cleanup via the inner/outer function split.

Issues and Suggestions:

1. Magic numbers for trie key lengths (store.rs)

  • Lines 2939-2940: The literals 65 and 131 should be named constants.
    // Suggested:
    const ACCOUNT_LEAF_KEY_LEN: usize = 65;  // 32 bytes hash + 1 byte prefix?
    const STORAGE_LEAF_KEY_LEN: usize = 131; // 64 bytes + prefix?
    let is_leaf = key.len() == ACCOUNT_LEAF_KEY_LEN || key.len() == STORAGE_LEAF_KEY_LEN;
    let is_account = key.len() <= ACCOUNT_LEAF_KEY_LEN;

2. Early success signal before disk persistence (store.rs)

  • Line 2926: result_sender.send(Ok(())) notifies the caller before the disk write transaction commits.
    • While this mirrors the normal path's async behavior, in batch mode the write happens synchronously immediately after. If the disk commit fails (line 2960), the caller has already been told the update succeeded.
    • Recommendation: Add a comment explaining that during full sync, a disk write failure is fatal and aborts the entire batch, making this ordering acceptable.

3. Unused method (layering.rs)

  • Line 365: put_batch is defined but never used. Either remove it or use it in apply_trie_updates (line 2919-2922) to reduce lock acquisition overhead:
    // Instead of locking in a loop:
    cache.put_batch(new_layer.iter().map(|(p, v)| (p.as_ref().to_vec(), v.clone())).collect());

4. Poisoned lock handling (layering.rs)

  • Line 385: cache.lock().ok()?.get(key) silently converts a poisoned mutex to a cache miss.
    • This is safe (falls back to disk) but hides errors. Consider logging this at warn level, as a poisoned lock indicates a panic in another thread that might signal deeper issues.

5. Unnecessary fallback in Clone (layering.rs)

  • Line 304: Since `Lru

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. High: the new batch path acknowledges trie visibility before the state is durable anywhere, then tears down the only cache that can serve it. Store::apply_updates() still returns after notify_rx.recv() at crates/storage/store.rs, but in the new fast path that notification is sent before write_tx.commit() at crates/storage/store.rs. Blockchain::add_blocks_in_batch() then disables the batch cache immediately after the call returns at crates/blockchain/blockchain.rs. That creates a correctness hole: if the background writer is still flushing, subsequent reads fall back to the layered cache, but batch mode never populated it; if the write fails, the worker only logs the error and the batch still reported success. I would keep the batch cache alive until the write is committed, or make batch-mode updates synchronous / acknowledged only after durability.

  2. High: FlatTrieCache changes trie reads from “state-root-aware” to “latest value by key”. TrieCacheRef::get() ignores state_root for the Flat variant at crates/storage/layering.rs, while many callers open tries for arbitrary historical roots, e.g. get_account_state_by_root() at crates/storage/store.rs, get_account_proof() at crates/storage/store.rs, and the RPC proof handler at crates/networking/rpc/eth/account.rs. While batch cache is enabled, those paths can return/prove the newest cached node instead of the node valid at the requested root. For blockchain code, that is a consensus-facing correctness bug. The flat cache needs root/version scoping, or it must be restricted to code paths that only ever read the current tip state.

No other EVM opcode or gas-accounting issues stood out in this diff. I couldn’t run cargo check in this environment because the sandbox blocks rustup temp-file creation.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is the full review of PR #6492:


PR Review: feat(storage): replace trie layer cache with LRU cache for full sync

Overview

This PR introduces a FlatTrieCache (LRU-based) that replaces the TrieLayerCache (diff-layer chain) during full sync batch mode. The motivation is sound: during batch sync there are no reorgs, so the bloom filter, RCU cloning, and layer-chain overhead of TrieLayerCache are wasteful. The TrieCacheRef enum cleanly dispatches between the two strategies without changes to normal CL block processing. The structural approach is good.


Critical Issues

1. Cleanup invariant violated on lock error (blockchain.rs:2175-2177)

The comment says "Always disable the batch cache, even on error," but the code uses ?:

self.storage
    .disable_batch_trie_cache()
    .map_err(|e| (ChainError::StoreError(e), None))?;

result

If disable_batch_trie_cache fails (poisoned RwLock), ? returns early and:

  • result (the inner error/success) is silently dropped.
  • batch_trie_cache remains Some, so all subsequent normal CL block processing continues using the flat LRU instead of the layered cache.

This is the opposite of the stated invariant. Fix with:

let disable_result = self.storage
    .disable_batch_trie_cache()
    .map_err(|e| (ChainError::StoreError(e), None));
result.and(disable_result)

Or, since poisoning here is only possible after a panic elsewhere, simply log and ignore the disable error rather than propagating it — the important thing is batch_trie_cache is always cleared.


2. result_sender.send(Ok(())) fires before disk commit (store.rs:2918-2921)

In the fast path:

// Populate the LRU cache ...
result_sender.send(Ok(())).map_err(|_| StoreError::LockError)?;

// Write nodes to disk immediately
let mut write_tx = backend.begin_write()?;
// ... writes ...
if result.is_ok() { result = write_tx.commit(); }
let _ = fkv_ctl.send(FKVGeneratorControlMessage::Continue);
return result;

The caller is told "success" before write_tx.commit(). If commit() fails, the background thread returns an error (logged via inspect_err), but the block processing pipeline has already moved on believing the block was committed.

In the normal layered path this is tolerable because uncommitted nodes remain in the in-memory layer chain until the next flush. In the flat LRU path, LRU entries can be evicted without persistence, and after a crash the state is irrecoverably lost. Even without a crash: if the disk write fails mid-batch, subsequent blocks may be built on top of a state that is partially missing from disk.

Consider moving result_sender.send(Ok(())) to after write_tx.commit() succeeds in the flat path, accepting the tradeoff that block execution must wait slightly longer on disk I/O.


Important Issues

3. Poisoned mutex silently becomes a cache miss (layering.rs, TrieCacheRef::get)

Self::Flat(cache) => cache.lock().ok()?.get(key),

If the mutex is poisoned (a panic occurred while it was held), lock().ok()? converts the error to None — a cache miss. No log, no error propagation. Under a poisoned mutex every trie read in the batch silently skips the cache, causing all reads to fall through to disk. Add at least a warn!() or error!() log so this doesn't go completely unnoticed.


4. FlatTrieCache::put_batch is dead code (layering.rs:340-343)

apply_trie_updates uses individual .put() calls in a loop rather than put_batch. Either remove put_batch or use it in the fast path to make the intent clear.


5. Table-routing logic duplicated verbatim (store.rs:2935-2950 vs 3012-3027)

The is_leaf / is_account key-length detection and the four-way table dispatch appear twice, byte-for-byte identical. Extract to a helper:

fn trie_table_for_key(key: &[u8]) -> (&'static TableId, bool /* skip_if_past_last_written */) {}

This avoids silent divergence if the encoding ever changes.


6. Magic numbers 65 and 131 need constants or documentation (store.rs:2935)

The values are correct — each nibble is stored as one byte (Nibbles::into_vec() is unpacked), so an account leaf is 64 nibbles + 1 leaf flag = 65 bytes, and a storage leaf is 64 (account hash) + 1 (leaf flag) + 1 (separator) + 64 (storage key) + 1 (leaf flag) = 131 bytes — but this is non-obvious. Named constants:

const ACCOUNT_LEAF_PATH_LEN: usize = 65;
const STORAGE_LEAF_PATH_LEN: usize = 131;

would make both instances self-documenting and eliminate the duplicate // Before encoding... comment that currently only appears in the normal path.


Minor Issues

7. Unreachable unwrap_or in FlatTrieCache::clone (layering.rs:305)

NonZeroUsize::new(cap.into()).unwrap_or(NonZeroUsize::new(1).unwrap())

cap is NonZeroUsize, so cap.into():usize is always > 0 and NonZeroUsize::new(cap.into()) is always Some. The unwrap_or branch is dead. Use cap directly or .expect("cap is NonZero").


8. Surprising Clone semantics on FlatTrieCache (layering.rs:295-310)

The Clone impl creates an empty cache rather than copying entries. The in-code comment explains this, but it's worth also noting what requires this bound (presumably TrieLayerCache or some derive on an enclosing type). If no external caller actually clones a FlatTrieCache value — the hot path clones the Arc<Mutex<FlatTrieCache>> — consider removing the Clone impl entirely to prevent accidental misuse.


9. No unit tests for the new path

There are no tests for FlatTrieCache (put/get/eviction), TrieCacheRef dispatch, or the enable/disable_batch_trie_cache lifecycle. Given that correctness bugs here can silently corrupt chain state, even basic property tests (capacity eviction, get after put, empty-after-clear) would help catch regressions.


Summary

The design concept is sound and the normal-path isolation (layered cache unchanged) is correctly implemented. The two critical issues — the cleanup invariant being falsifiable by a lock error, and signaling success before the disk commit in the flat path — need to be addressed before this lands, as both can lead to persistent data inconsistency during full sync.


Automated review by Claude (Anthropic) · sonnet · custom prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Replace Trie Layers with LRU cache in Full sync

1 participant