Motivation
During snap sync's storage phase (insert_storages), profiling from mainnet run 20260412_172457 revealed that the small-account tail accounts for ~80% of idle thread-seconds — a dominant parallelism loss that existing big-account parallelization work (#5482) would only marginally address.
This issue proposes batching small storage-trie tasks to amortize dispatcher overhead, which preliminary analysis suggests has a higher realistic gain than monster-trie parallelization.
Profiling baseline
From insert_storages aggregate on that run:
wall = 2347.6s (39m 07s)
total_trie_cpu = 18965.1s (sum of trie CPU time across all tasks)
parallelism = 8.1x (out of 16 threads → 50% utilization)
wait_time = 1619.2s (dispatcher blocked on receiver.recv(), 69% of wall)
accounts = 26,311,772
total_leaves = 1,552,108,325
avg_trie = 0ms (most accounts <1ms)
max_trie = 244.9s (monster: 159.6M leaves — see below)
Thread utilization breakdown:
- Available: 16 threads × 2347s wall = 37,554 thread-seconds
- Used:
total_trie_cpu = 18,965 thread-seconds
- Idle: 18,589 thread-seconds (49%)
Where the idle time comes from
Decomposition of the 18,589 idle thread-seconds:
Large accounts (~20% of the loss)
Only 30 accounts are "large" (>5s trie build) out of 26.3M total:
| Account prefix |
Leaves |
Wall |
Notes |
159e489… (monster) |
159.6M |
244.9s |
99% CPU-bound, likely Uniswap v3 |
ab14d68… |
23.7M |
34.5s |
|
1ff0800… |
17.7M |
32.7s |
|
16.3M |
— |
23s |
|
15.7M |
— |
22.4s |
|
| ~25 more in 3–10M range |
— |
5–15s each |
|
If the monster ran single-threaded while all 15 other threads idled: 244.9 × 15 = 3,674 thread-seconds — only ~20% of the 18,589 idle. Even fully parallelizing the monster (the subject of #5482) cannot reclaim the remaining 80%.
Small-account tail (~80% of the loss) — this issue's target
26.3M accounts with avg <1ms trie build time. Each requires the full dispatcher roundtrip:
- Dispatcher sends task (channel
send)
- Worker runs trie build (<1ms)
- Worker sends result
- Dispatcher
recvs, releases slot, sends next
The wait_time=1619s (69% of wall) is the dispatcher blocked on receiver.recv() — workers finishing <1ms each means slot-turnover rate is the bottleneck, not worker compute. The 16-slot pool spends most of its time empty because dispatcher bookkeeping dominates.
Proposal
Bundle N small accounts per worker job to amortize dispatch overhead. Large accounts continue to run as single-account tasks.
Sketch
// Threshold: small = expected trie build < 10ms or < 50K leaves
// (tunable; measure on a few runs to find the knee)
const SMALL_ACCOUNT_LEAVES_THRESHOLD: usize = 50_000;
const BATCH_SIZE: usize = 64;
// Dispatcher side:
let mut small_batch = Vec::with_capacity(BATCH_SIZE);
for (account_hash, storage_root, leaf_count) in accounts_iter {
if leaf_count >= SMALL_ACCOUNT_LEAVES_THRESHOLD {
// Large account — dispatch on its own (current behavior)
flush_batch_if_any(&mut small_batch)?;
dispatch_single(&account_hash, &storage_root)?;
} else {
small_batch.push((account_hash, storage_root));
if small_batch.len() >= BATCH_SIZE {
dispatch_batch(std::mem::take(&mut small_batch))?;
}
}
}
flush_batch_if_any(&mut small_batch)?;
// Worker side:
fn process_batch(batch: Vec<(H256, H256)>) -> Result<Vec<TrieStats>, _> {
batch.into_iter().map(|(addr, root)| {
trie_from_sorted_accounts_with_stats(addr, root)
}).collect()
}
Tuning parameters
SMALL_ACCOUNT_LEAVES_THRESHOLD — where small ends and large begins. The profiling run shows a clear bimodal distribution (<1ms median vs 5–245s tail), so the threshold is not highly sensitive. Suggest 50K leaves or 10ms historical trie time as a starting point.
BATCH_SIZE — too small = not enough amortization; too large = load imbalance (last bucket finishes after others). Suggest 32–128; benchmark.
- Per-batch result aggregation — batch workers should still emit per-account stats so existing per-account profiling (
TrieInsertStats) stays intact.
Expected gain
Hard to estimate precisely without a prototype. Rough upper bound: if batching reduces dispatcher overhead from dominant (1619s) to amortized-to-negligible, we could recover a large fraction of the 18,589 idle thread-seconds — ballpark 5–15 minutes off the 39-minute storage phase (30–40% reduction).
For comparison:
These are additive: fixing the monster unlocks one thread for 245s; fixing batch dispatch unlocks slot-turnover for the entire long tail.
Architectural dependencies
None. Unlike #5482's trie_from_sorted_parallel approach, batching is a change inside the existing dispatcher loop — it doesn't touch per-account workflow. This means it can ship independently of:
Validation plan
- Baseline — confirm the 18,589 thread-seconds idle number with OS-level CPU samples (e.g.
pidstat -p <pid> -u 1 during a run). Current number is derived from parallelism=8.1x which slightly overstates OS CPU usage.
- Prototype — implement with
BATCH_SIZE=64, THRESHOLD=50K leaves. Run on mainnet, compare wall + parallelism + wait_time.
- Sweep — vary
BATCH_SIZE over {16, 32, 64, 128, 256}, plot wall vs batch size.
- Validate downstream — ensure per-account profiling (
TrieInsertStats) still emits correctly; confirm insert_storages debug-assertion passes validate no regression in trie correctness.
References
Instrumentation gaps worth closing first (optional)
The profiling that motivates this issue has limitations that would sharpen the prototype's feedback loop:
- No
insert_accounts aggregate line — can't directly compare account vs storage phase utilization
- No per-flush timing —
flush_nodes_to_write is where real disk writes happen; currently no instrumentation. Adding flush_time to TrieInsertStats would let us confirm whether we're I/O-bound or CPU-bound after batching
io_wait misnamed — currently refers to buffer-pool recv blocks, not disk I/O. Rename to buffer_wait
- No OS-level CPU util samples — capture via
pidstat in parallel during runs
- No storage-ranges throughput aggregate — per-download logs exist but no "slots/s over N minutes" roll-up
These are tracked loosely in #6470 (observability PR) backlog. Not blockers for the batching prototype but would help validate its impact.
Motivation
During snap sync's storage phase (
insert_storages), profiling from mainnet run20260412_172457revealed that the small-account tail accounts for ~80% of idle thread-seconds — a dominant parallelism loss that existing big-account parallelization work (#5482) would only marginally address.This issue proposes batching small storage-trie tasks to amortize dispatcher overhead, which preliminary analysis suggests has a higher realistic gain than monster-trie parallelization.
Profiling baseline
From
insert_storagesaggregate on that run:Thread utilization breakdown:
total_trie_cpu= 18,965 thread-secondsWhere the idle time comes from
Decomposition of the 18,589 idle thread-seconds:
Large accounts (~20% of the loss)
Only 30 accounts are "large" (>5s trie build) out of 26.3M total:
159e489…(monster)ab14d68…1ff0800…16.3M15.7MIf the monster ran single-threaded while all 15 other threads idled: 244.9 × 15 = 3,674 thread-seconds — only ~20% of the 18,589 idle. Even fully parallelizing the monster (the subject of #5482) cannot reclaim the remaining 80%.
Small-account tail (~80% of the loss) — this issue's target
26.3M accounts with avg <1ms trie build time. Each requires the full dispatcher roundtrip:
send)recvs, releases slot, sends nextThe
wait_time=1619s(69% of wall) is the dispatcher blocked onreceiver.recv()— workers finishing <1ms each means slot-turnover rate is the bottleneck, not worker compute. The 16-slot pool spends most of its time empty because dispatcher bookkeeping dominates.Proposal
Bundle N small accounts per worker job to amortize dispatch overhead. Large accounts continue to run as single-account tasks.
Sketch
Tuning parameters
SMALL_ACCOUNT_LEAVES_THRESHOLD— where small ends and large begins. The profiling run shows a clear bimodal distribution (<1ms median vs 5–245s tail), so the threshold is not highly sensitive. Suggest 50K leaves or 10ms historical trie time as a starting point.BATCH_SIZE— too small = not enough amortization; too large = load imbalance (last bucket finishes after others). Suggest 32–128; benchmark.TrieInsertStats) stays intact.Expected gain
Hard to estimate precisely without a prototype. Rough upper bound: if batching reduces dispatcher overhead from dominant (1619s) to amortized-to-negligible, we could recover a large fraction of the 18,589 idle thread-seconds — ballpark 5–15 minutes off the 39-minute storage phase (30–40% reduction).
For comparison:
These are additive: fixing the monster unlocks one thread for 245s; fixing batch dispatch unlocks slot-turnover for the entire long tail.
Architectural dependencies
None. Unlike #5482's
trie_from_sorted_parallelapproach, batching is a change inside the existing dispatcher loop — it doesn't touch per-account workflow. This means it can ship independently of:StorageActoranyway, batching logic carries overStorageTrieTrackerrefactor (Snap Sync Refactor: download storage slots plan #6170) — operates at the download layer, not the insert layerValidation plan
pidstat -p <pid> -u 1during a run). Current number is derived fromparallelism=8.1xwhich slightly overstates OS CPU usage.BATCH_SIZE=64,THRESHOLD=50K leaves. Run on mainnet, comparewall+parallelism+wait_time.BATCH_SIZEover {16, 32, 64, 128, 256}, plotwallvs batch size.TrieInsertStats) still emits correctly; confirminsert_storagesdebug-assertion passes validate no regression in trie correctness.References
SNAP_SYNC_WORKSTREAMS.md§Profiling Findings (internal doc)crates/networking/p2p/sync/snap_sync.rs:1285-1429,crates/common/trie/trie_sorted.rs:189-237Instrumentation gaps worth closing first (optional)
The profiling that motivates this issue has limitations that would sharpen the prototype's feedback loop:
insert_accountsaggregate line — can't directly compare account vs storage phase utilizationflush_nodes_to_writeis where real disk writes happen; currently no instrumentation. Addingflush_timetoTrieInsertStatswould let us confirm whether we're I/O-bound or CPU-bound after batchingio_waitmisnamed — currently refers to buffer-pool recv blocks, not disk I/O. Rename tobuffer_waitpidstatin parallel during runsThese are tracked loosely in #6470 (observability PR) backlog. Not blockers for the batching prototype but would help validate its impact.