Skip to content

perf(l1): small-account batching in insert_storages to amortize dispatcher overhead #6476

@ElFantasma

Description

@ElFantasma

Motivation

During snap sync's storage phase (insert_storages), profiling from mainnet run 20260412_172457 revealed that the small-account tail accounts for ~80% of idle thread-seconds — a dominant parallelism loss that existing big-account parallelization work (#5482) would only marginally address.

This issue proposes batching small storage-trie tasks to amortize dispatcher overhead, which preliminary analysis suggests has a higher realistic gain than monster-trie parallelization.

Profiling baseline

From insert_storages aggregate on that run:

wall            = 2347.6s  (39m 07s)
total_trie_cpu  = 18965.1s  (sum of trie CPU time across all tasks)
parallelism     = 8.1x      (out of 16 threads → 50% utilization)
wait_time       = 1619.2s   (dispatcher blocked on receiver.recv(), 69% of wall)
accounts        = 26,311,772
total_leaves    = 1,552,108,325
avg_trie        = 0ms       (most accounts <1ms)
max_trie        = 244.9s    (monster: 159.6M leaves — see below)

Thread utilization breakdown:

  • Available: 16 threads × 2347s wall = 37,554 thread-seconds
  • Used: total_trie_cpu = 18,965 thread-seconds
  • Idle: 18,589 thread-seconds (49%)

Where the idle time comes from

Decomposition of the 18,589 idle thread-seconds:

Large accounts (~20% of the loss)

Only 30 accounts are "large" (>5s trie build) out of 26.3M total:

Account prefix Leaves Wall Notes
159e489… (monster) 159.6M 244.9s 99% CPU-bound, likely Uniswap v3
ab14d68… 23.7M 34.5s
1ff0800… 17.7M 32.7s
16.3M 23s
15.7M 22.4s
~25 more in 3–10M range 5–15s each

If the monster ran single-threaded while all 15 other threads idled: 244.9 × 15 = 3,674 thread-seconds — only ~20% of the 18,589 idle. Even fully parallelizing the monster (the subject of #5482) cannot reclaim the remaining 80%.

Small-account tail (~80% of the loss) — this issue's target

26.3M accounts with avg <1ms trie build time. Each requires the full dispatcher roundtrip:

  1. Dispatcher sends task (channel send)
  2. Worker runs trie build (<1ms)
  3. Worker sends result
  4. Dispatcher recvs, releases slot, sends next

The wait_time=1619s (69% of wall) is the dispatcher blocked on receiver.recv() — workers finishing <1ms each means slot-turnover rate is the bottleneck, not worker compute. The 16-slot pool spends most of its time empty because dispatcher bookkeeping dominates.

Proposal

Bundle N small accounts per worker job to amortize dispatch overhead. Large accounts continue to run as single-account tasks.

Sketch

// Threshold: small = expected trie build < 10ms or < 50K leaves
// (tunable; measure on a few runs to find the knee)
const SMALL_ACCOUNT_LEAVES_THRESHOLD: usize = 50_000;
const BATCH_SIZE: usize = 64;

// Dispatcher side:
let mut small_batch = Vec::with_capacity(BATCH_SIZE);
for (account_hash, storage_root, leaf_count) in accounts_iter {
    if leaf_count >= SMALL_ACCOUNT_LEAVES_THRESHOLD {
        // Large account — dispatch on its own (current behavior)
        flush_batch_if_any(&mut small_batch)?;
        dispatch_single(&account_hash, &storage_root)?;
    } else {
        small_batch.push((account_hash, storage_root));
        if small_batch.len() >= BATCH_SIZE {
            dispatch_batch(std::mem::take(&mut small_batch))?;
        }
    }
}
flush_batch_if_any(&mut small_batch)?;

// Worker side:
fn process_batch(batch: Vec<(H256, H256)>) -> Result<Vec<TrieStats>, _> {
    batch.into_iter().map(|(addr, root)| {
        trie_from_sorted_accounts_with_stats(addr, root)
    }).collect()
}

Tuning parameters

  • SMALL_ACCOUNT_LEAVES_THRESHOLD — where small ends and large begins. The profiling run shows a clear bimodal distribution (<1ms median vs 5–245s tail), so the threshold is not highly sensitive. Suggest 50K leaves or 10ms historical trie time as a starting point.
  • BATCH_SIZE — too small = not enough amortization; too large = load imbalance (last bucket finishes after others). Suggest 32–128; benchmark.
  • Per-batch result aggregation — batch workers should still emit per-account stats so existing per-account profiling (TrieInsertStats) stays intact.

Expected gain

Hard to estimate precisely without a prototype. Rough upper bound: if batching reduces dispatcher overhead from dominant (1619s) to amortized-to-negligible, we could recover a large fraction of the 18,589 idle thread-seconds — ballpark 5–15 minutes off the 39-minute storage phase (30–40% reduction).

For comparison:

These are additive: fixing the monster unlocks one thread for 245s; fixing batch dispatch unlocks slot-turnover for the entire long tail.

Architectural dependencies

None. Unlike #5482's trie_from_sorted_parallel approach, batching is a change inside the existing dispatcher loop — it doesn't touch per-account workflow. This means it can ship independently of:

Validation plan

  1. Baseline — confirm the 18,589 thread-seconds idle number with OS-level CPU samples (e.g. pidstat -p <pid> -u 1 during a run). Current number is derived from parallelism=8.1x which slightly overstates OS CPU usage.
  2. Prototype — implement with BATCH_SIZE=64, THRESHOLD=50K leaves. Run on mainnet, compare wall + parallelism + wait_time.
  3. Sweep — vary BATCH_SIZE over {16, 32, 64, 128, 256}, plot wall vs batch size.
  4. Validate downstream — ensure per-account profiling (TrieInsertStats) still emits correctly; confirm insert_storages debug-assertion passes validate no regression in trie correctness.

References

Instrumentation gaps worth closing first (optional)

The profiling that motivates this issue has limitations that would sharpen the prototype's feedback loop:

  • No insert_accounts aggregate line — can't directly compare account vs storage phase utilization
  • No per-flush timingflush_nodes_to_write is where real disk writes happen; currently no instrumentation. Adding flush_time to TrieInsertStats would let us confirm whether we're I/O-bound or CPU-bound after batching
  • io_wait misnamed — currently refers to buffer-pool recv blocks, not disk I/O. Rename to buffer_wait
  • No OS-level CPU util samples — capture via pidstat in parallel during runs
  • No storage-ranges throughput aggregate — per-download logs exist but no "slots/s over N minutes" roll-up

These are tracked loosely in #6470 (observability PR) backlog. Not blockers for the batching prototype but would help validate its impact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    L1Ethereum client

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions