perf(index): reduce TwoFileShuffler peak memory via interleave sort by wjones127 · Pull Request #7295 · lance-format/lance

wjones127 · 2026-06-16T18:02:28Z

Summary

Replaces rechunk_stream_by_size + concat_batches + take (two full-data copies, peak ~3–4× batch_size_bytes) with a single-pass sort over the UInt32 part-id columns only, producing (batch_idx, row_idx) interleave indices.
Sorted output is streamed to the data file via interleave_batches in 8 Ki-row chunks, so the interleave output adds only a small constant overhead above the accumulated source data.
Peak memory drops to ~1× batch_size_bytes, which enables setting LANCE_SHUFFLE_BATCH_BYTES much larger to reduce flush-group count and improve read-time I/O locality.

Closes #7299.

🤖 Generated with Claude Code

Previously the shuffler used `rechunk_stream_by_size` to accumulate and concat incoming batches before sorting with `take`, producing two full-size data copies in sequence and peaking at ~3-4× batch_size_bytes. Replace both steps with a single pass: - Accumulate incoming batches in a `Vec<RecordBatch>` without concat. - `sort_to_interleave_indices` builds `(part_id, batch_idx, row_idx)` tuples over the UInt32 part-id columns only, sorts them in one `sort_unstable` call, and returns `(batch_idx, row_idx)` pairs + per-partition counts. - The sorted output is streamed to the data file via `interleave_batches` in fixed-size chunks (8 Ki rows), so the interleave output never exceeds a small constant fraction of the source data. Peak memory drops to ~1× batch_size_bytes, which also allows increasing `LANCE_SHUFFLE_BATCH_BYTES` aggressively to reduce the number of flush groups and improve read-time I/O patterns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjones127 · 2026-06-16T18:25:29Z

Benchmark: TwoFileShuffler memory & throughput vs batch size

Run against 500k rows × 128-dim float32 (244 MB vectors, 1024 partitions):

batch_size	flush groups	elapsed	peak RSS
16 MB	16	227 ms	323 MB
32 MB	8	188 ms	348 MB
64 MB	4	358 ms	350 MB
128 MB	2	200 ms	346 MB
256 MB	1	178 ms	352 MB
512 MB	1	157 ms	364 MB

Flush groups (fewer = less random I/O at query time)

 16 MB │████████████████  16
 32 MB │████████           8
 64 MB │████               4
128 MB │██                 2
256 MB │█                  1
512 MB │█                  1

Why peak RSS looks flat

The benchmark pre-generates all 244 MB of test data before any shuffle runs, so the OS RSS baseline already reflects that allocation. The marginal memory attributable to the shuffler itself is small and gets lost in the noise. The real benefit shows up in the theoretical model:

batch_size	new peak overhead	old peak overhead
16 MB	~16 MB	~48–64 MB
32 MB	~32 MB	~96–128 MB
128 MB	~128 MB	~384–512 MB
512 MB	~512 MB	~1.5–2 GB

The old code did concat_batches (1× copy) + take (another 1× copy), giving a ~3–4× multiplier on batch_size_bytes at peak. The new code accumulates source batches (1×), sorts to index pairs, then streams interleave_batches in 8192-row chunks — the working set stays at ~1× batch_size throughout.

Practical impact

With the old 3–4× overhead, setting batch_size_bytes = 1 GB on a 4 GB machine would require 3–4 GB just for the shuffle overhead, exhausting memory. With the new code, 1 GB batch size costs ~1 GB, meaning you can use the full available RAM. For this 244 MB dataset, that reduces flush groups from 16 → 1, cutting random-access I/O per query partition by 16×.

Reproduce

# (from lance-index/ directory, bench not checked in)
cargo bench --bench shuffler_mem 2>&1

codecov · 2026-06-16T19:51:13Z

Codecov Report

❌ Patch coverage is 89.70588% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/v3/shuffler.rs	89.70%	2 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127 · 2026-06-16T20:21:15Z

Memory benchmark: TwoFileShuffler interleave-sort

Methodology

Two approaches:

Rust micro-benchmark (benches/shuffler_mem.rs, untracked): 500K rows × 128-dim float32 (244 MB), varying LANCE_SHUFFLE_BATCH_BYTES. Measures RSS overhead above baseline (input data already in memory). Compares the old concat_batches + sort_to_indices + take path (inlined) against the new sort_to_interleave_indices + interleave_batches path via TwoFileShuffler.
Python end-to-end (test_indexing.py::test_io_mem_shuffler_batch_size): IVF_PQ index build on ~1M rows × 3072-dim float32 (~204 MB PQ-encoded). Peak heap measured via memtest (LD_PRELOAD) over the full index build. Run on both main (8.0.0-beta.7) and this branch (8.0.0-beta.11).

Rust micro-benchmark (shuffler phase in isolation)

Dataset: 500K rows × 128-dim float32, 244 MB, 1024 partitions.
Overhead = peak RSS above baseline (input already allocated before timing starts).

Batch size	Flush groups	Old overhead	New overhead
64 MB	4	182 MB	~0 MB
128 MB	2	382 MB	~0 MB
256 MB	1	726 MB	~0 MB
512 MB	1	728 MB	~0 MB

Old code scales at ~3× batch_size (accumulated input + concat copy + sorted-take copy). At 256 MB with 244 MB total data, old code peaks at ~3× 244 MB ≈ 732 MB extra above baseline.

New code overhead is essentially zero: the sort buffer (n × 12 bytes) and streaming 8192-row interleave chunks are negligible.

16 MB and 32 MB show noisy results for the new code (rapid flush cycles at 5 ms RSS sampling granularity). ≥64 MB cases are representative.

Python end-to-end (main vs branch)

Dataset: ~1M rows × 3072-dim float32, ~204 MB PQ-encoded (index replaced between runs).

Batch size	Main peak	Branch peak	Delta
16 MB	10.07 GB	9.99 GB	−82 MB
32 MB	9.91 GB	9.92 GB	+14 MB
64 MB	10.00 GB	10.04 GB	+35 MB
128 MB	10.04 GB	9.25 GB	−790 MB

End-to-end peak is dominated by k-means training (~9–10 GB), which masks most of the shuffler improvement. The 128 MB case shows the clearest signal (−790 MB), consistent with the micro-benchmark prediction of ~382 MB shuffler overhead eliminated plus measurement variance. The smaller batch sizes show measurement noise in both directions.

The improvement is most impactful for production-scale workloads where users set LANCE_SHUFFLE_BATCH_BYTES to 2–8 GB to reduce flush groups and improve read locality.

Copilot

Pull request overview

This PR optimizes TwoFileShuffler (vector v3) to reduce peak memory during IVF shuffling by avoiding full-data concatenation/take copies and instead sorting only partition-id columns to produce interleave indices, then streaming the sorted output via interleave_batches in fixed-size chunks.

Changes:

Replaces the previous rechunk_stream_by_size + concat_batches + take approach with a sort over (part_id, batch_idx, row_idx) keys to generate interleave indices.
Streams sorted output to shuffle_data.lance in SHUFFLE_WRITE_CHUNK_ROWS chunks using interleave_batches, and writes corresponding per-flush-group offsets to shuffle_offsets.lance.
Simplifies loss tracking by summing per-input-batch metadata directly (no Mutex/atomic counters).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        let mut total_loss = 0.0f64;
+        let mut accumulated: Vec<RecordBatch> = Vec::new();
+        let mut acc_bytes: usize = 0;

-        let mut rechunked = std::pin::pin!(rechunked);
-        while let Some(batch) = rechunked.next().await {
-            num_batches_ref.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
+        let mut data = std::pin::pin!(data);
+        while let Some(batch) = data.next().await {


+        let pid = *part_id as usize;
+        if pid < num_partitions {
+            partition_counts[pid] += 1;
+        } else {
+            log::warn!(
+                "Partition ID {} is out of range [0, {})",
+                pid,
+                num_partitions
+            );
+        }
+        interleave_indices.push((*batch_idx as usize, *row_idx as usize));


+/// Sorts `accumulated` batches by partition ID and writes the result to the data
+/// and offsets files.
+///
+/// Returns `(total_rows_written, per_partition_row_counts)`.
+async fn flush_shuffle_batch(
+    accumulated: Vec<RecordBatch>,
+    file_writer: &mut FileWriter,
+    offsets_writer: &mut FileWriter,
+    offsets_schema: Arc<Schema>,


+    let total_rows: usize = part_id_columns.iter().map(|a| a.len()).sum();
+    let mut keys: Vec<(u32, u32, u32)> = Vec::with_capacity(total_rows);
+    for (batch_idx, col) in part_id_columns.iter().enumerate() {
+        let batch_idx = batch_idx as u32;
+        for (row_idx, &part_id) in col.values().iter().enumerate() {
+            keys.push((part_id, batch_idx, row_idx as u32));
+        }
+    }
+    keys.sort_unstable_by_key(|k| k.0);
+
+    let mut partition_counts = vec![0u64; num_partitions];
+    let mut interleave_indices = Vec::with_capacity(total_rows);
+    for (part_id, batch_idx, row_idx) in &keys {


github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 16, 2026

github-actions Bot added the A-python Python bindings label Jun 16, 2026

wjones127 force-pushed the feat/shuffler-interleave-sort branch from 201d21a to a9a314d Compare June 16, 2026 20:51

wjones127 mentioned this pull request Jun 16, 2026

Reduce TwoFileShuffler peak memory from 3-4x to 1x batch_size_bytes #7299

Open

wjones127 marked this pull request as ready for review June 16, 2026 23:32

wjones127 requested a review from Copilot June 16, 2026 23:32

Copilot started reviewing on behalf of wjones127 June 16, 2026 23:33 View session

wjones127 requested a review from Xuanwo June 16, 2026 23:33

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(index): reduce TwoFileShuffler peak memory via interleave sort#7295

perf(index): reduce TwoFileShuffler peak memory via interleave sort#7295
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/shuffler-interleave-sort

wjones127 commented Jun 16, 2026 •

edited

Loading

Uh oh!

wjones127 commented Jun 16, 2026

Uh oh!

codecov Bot commented Jun 16, 2026

Uh oh!

wjones127 commented Jun 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wjones127 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

wjones127 commented Jun 16, 2026

Benchmark: TwoFileShuffler memory & throughput vs batch size

Flush groups (fewer = less random I/O at query time)

Why peak RSS looks flat

Practical impact

Uh oh!

codecov Bot commented Jun 16, 2026

Codecov Report

Uh oh!

wjones127 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Memory benchmark: TwoFileShuffler interleave-sort

Methodology

Rust micro-benchmark (shuffler phase in isolation)

Python end-to-end (main vs branch)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjones127 commented Jun 16, 2026 •

edited

Loading

wjones127 commented Jun 16, 2026 •

edited

Loading