perf(l1): small-account batching in insert_storages to amortize dispatcher overhead

## Motivation

During snap sync's storage phase (`insert_storages`), profiling from mainnet run `20260412_172457` revealed that **the small-account tail accounts for ~80% of idle thread-seconds** — a dominant parallelism loss that existing big-account parallelization work (#5482) would only marginally address.

This issue proposes batching small storage-trie tasks to amortize dispatcher overhead, which preliminary analysis suggests has a **higher realistic gain than monster-trie parallelization**.

## Profiling baseline

From `insert_storages` aggregate on that run:

```
wall            = 2347.6s  (39m 07s)
total_trie_cpu  = 18965.1s  (sum of trie CPU time across all tasks)
parallelism     = 8.1x      (out of 16 threads → 50% utilization)
wait_time       = 1619.2s   (dispatcher blocked on receiver.recv(), 69% of wall)
accounts        = 26,311,772
total_leaves    = 1,552,108,325
avg_trie        = 0ms       (most accounts <1ms)
max_trie        = 244.9s    (monster: 159.6M leaves — see below)
```

Thread utilization breakdown:
- **Available:** 16 threads × 2347s wall = **37,554 thread-seconds**
- **Used:** `total_trie_cpu` = 18,965 thread-seconds
- **Idle:** **18,589 thread-seconds (49%)**

## Where the idle time comes from

Decomposition of the 18,589 idle thread-seconds:

### Large accounts (~20% of the loss)

Only 30 accounts are "large" (>5s trie build) out of 26.3M total:

| Account prefix | Leaves | Wall | Notes |
|---|---|---|---|
| `159e489…` (monster) | 159.6M | 244.9s | 99% CPU-bound, likely Uniswap v3 |
| `ab14d68…` | 23.7M | 34.5s | |
| `1ff0800…` | 17.7M | 32.7s | |
| `16.3M` | — | 23s | |
| `15.7M` | — | 22.4s | |
| ~25 more in 3–10M range | — | 5–15s each | |

If the monster ran single-threaded while all 15 other threads idled: 244.9 × 15 = **3,674 thread-seconds** — only ~20% of the 18,589 idle. Even fully parallelizing the monster (the subject of #5482) cannot reclaim the remaining 80%.

### Small-account tail (~80% of the loss) — **this issue's target**

26.3M accounts with avg <1ms trie build time. Each requires the full dispatcher roundtrip:

1. Dispatcher sends task (channel `send`)
2. Worker runs trie build (<1ms)
3. Worker sends result
4. Dispatcher `recv`s, releases slot, sends next

The `wait_time=1619s` (69% of wall) is the dispatcher blocked on `receiver.recv()` — workers finishing <1ms each means **slot-turnover rate is the bottleneck, not worker compute**. The 16-slot pool spends most of its time empty because dispatcher bookkeeping dominates.

## Proposal

**Bundle N small accounts per worker job** to amortize dispatch overhead. Large accounts continue to run as single-account tasks.

### Sketch

```rust
// Threshold: small = expected trie build < 10ms or < 50K leaves
// (tunable; measure on a few runs to find the knee)
const SMALL_ACCOUNT_LEAVES_THRESHOLD: usize = 50_000;
const BATCH_SIZE: usize = 64;

// Dispatcher side:
let mut small_batch = Vec::with_capacity(BATCH_SIZE);
for (account_hash, storage_root, leaf_count) in accounts_iter {
    if leaf_count >= SMALL_ACCOUNT_LEAVES_THRESHOLD {
        // Large account — dispatch on its own (current behavior)
        flush_batch_if_any(&mut small_batch)?;
        dispatch_single(&account_hash, &storage_root)?;
    } else {
        small_batch.push((account_hash, storage_root));
        if small_batch.len() >= BATCH_SIZE {
            dispatch_batch(std::mem::take(&mut small_batch))?;
        }
    }
}
flush_batch_if_any(&mut small_batch)?;

// Worker side:
fn process_batch(batch: Vec<(H256, H256)>) -> Result<Vec<TrieStats>, _> {
    batch.into_iter().map(|(addr, root)| {
        trie_from_sorted_accounts_with_stats(addr, root)
    }).collect()
}
```

### Tuning parameters

- **`SMALL_ACCOUNT_LEAVES_THRESHOLD`** — where small ends and large begins. The profiling run shows a clear bimodal distribution (<1ms median vs 5–245s tail), so the threshold is not highly sensitive. Suggest 50K leaves or 10ms historical trie time as a starting point.
- **`BATCH_SIZE`** — too small = not enough amortization; too large = load imbalance (last bucket finishes after others). Suggest 32–128; benchmark.
- **Per-batch result aggregation** — batch workers should still emit per-account stats so existing per-account profiling (`TrieInsertStats`) stays intact.

## Expected gain

**Hard to estimate precisely without a prototype.** Rough upper bound: if batching reduces dispatcher overhead from dominant (1619s) to amortized-to-negligible, we could recover a large fraction of the 18,589 idle thread-seconds — ballpark **5–15 minutes off the 39-minute storage phase** (30–40% reduction).

For comparison:
- Monster parallelization (#5482, realistic): ~100–150s saved (~5–6% of storage phase)
- Small-account batching (this issue): potentially ~5–15 min saved (~15–40%)

These are **additive**: fixing the monster unlocks one thread for 245s; fixing batch dispatch unlocks slot-turnover for the entire long tail.

## Architectural dependencies

**None.** Unlike #5482's `trie_from_sorted_parallel` approach, batching is a change inside the existing dispatcher loop — it doesn't touch per-account workflow. This means it can ship independently of:

- The spawned migration (#4240) — the dispatcher refactor happens inside a future `StorageActor` anyway, batching logic carries over
- The monster parallelization (#5482) — orthogonal; batching handles small accounts, #5482 handles large ones
- `StorageTrieTracker` refactor (#6170) — operates at the download layer, not the insert layer

## Validation plan

1. **Baseline** — confirm the 18,589 thread-seconds idle number with OS-level CPU samples (e.g. `pidstat -p <pid> -u 1` during a run). Current number is derived from `parallelism=8.1x` which slightly overstates OS CPU usage.
2. **Prototype** — implement with `BATCH_SIZE=64`, `THRESHOLD=50K leaves`. Run on mainnet, compare `wall` + `parallelism` + `wait_time`.
3. **Sweep** — vary `BATCH_SIZE` over {16, 32, 64, 128, 256}, plot `wall` vs batch size.
4. **Validate downstream** — ensure per-account profiling (`TrieInsertStats`) still emits correctly; confirm `insert_storages` debug-assertion passes validate no regression in trie correctness.

## References

- Full profiling context: [`SNAP_SYNC_WORKSTREAMS.md` §Profiling Findings](../../blob/main/SNAP_SYNC_WORKSTREAMS.md) (internal doc)
- Related but distinct: #5482 (big-account / monster parallelization)
- Related but distinct: #4240 (snap sync rewrite with spawned)
- Trie build instrumentation source: `crates/networking/p2p/sync/snap_sync.rs:1285-1429`, `crates/common/trie/trie_sorted.rs:189-237`

## Instrumentation gaps worth closing first (optional)

The profiling that motivates this issue has limitations that would sharpen the prototype's feedback loop:

- **No `insert_accounts` aggregate line** — can't directly compare account vs storage phase utilization
- **No per-flush timing** — `flush_nodes_to_write` is where real disk writes happen; currently no instrumentation. Adding `flush_time` to `TrieInsertStats` would let us confirm whether we're I/O-bound or CPU-bound after batching
- **`io_wait` misnamed** — currently refers to buffer-pool recv blocks, not disk I/O. Rename to `buffer_wait`
- **No OS-level CPU util samples** — capture via `pidstat` in parallel during runs
- **No storage-ranges throughput aggregate** — per-download logs exist but no "slots/s over N minutes" roll-up

These are tracked loosely in #6470 (observability PR) backlog. Not blockers for the batching prototype but would help validate its impact.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(l1): small-account batching in insert_storages to amortize dispatcher overhead #6476

Motivation

Profiling baseline

Where the idle time comes from

Large accounts (~20% of the loss)

Small-account tail (~80% of the loss) — this issue's target

Proposal

Sketch

Tuning parameters

Expected gain

Architectural dependencies

Validation plan

References

Instrumentation gaps worth closing first (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Account prefix	Leaves	Wall	Notes
`159e489…` (monster)	159.6M	244.9s	99% CPU-bound, likely Uniswap v3
`ab14d68…`	23.7M	34.5s
`1ff0800…`	17.7M	32.7s
`16.3M`	—	23s
`15.7M`	—	22.4s
~25 more in 3–10M range	—	5–15s each

perf(l1): small-account batching in insert_storages to amortize dispatcher overhead #6476

Description

Motivation

Profiling baseline

Where the idle time comes from

Large accounts (~20% of the loss)

Small-account tail (~80% of the loss) — this issue's target

Proposal

Sketch

Tuning parameters

Expected gain

Architectural dependencies

Validation plan

References

Instrumentation gaps worth closing first (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions