Skip to content

perf(index): reduce TwoFileShuffler peak memory via interleave sort#7295

Open
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/shuffler-interleave-sort
Open

perf(index): reduce TwoFileShuffler peak memory via interleave sort#7295
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:feat/shuffler-interleave-sort

Conversation

@wjones127

@wjones127 wjones127 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaces rechunk_stream_by_size + concat_batches + take (two full-data copies, peak ~3–4× batch_size_bytes) with a single-pass sort over the UInt32 part-id columns only, producing (batch_idx, row_idx) interleave indices.
  • Sorted output is streamed to the data file via interleave_batches in 8 Ki-row chunks, so the interleave output adds only a small constant overhead above the accumulated source data.
  • Peak memory drops to ~1× batch_size_bytes, which enables setting LANCE_SHUFFLE_BATCH_BYTES much larger to reduce flush-group count and improve read-time I/O locality.

Closes #7299.

🤖 Generated with Claude Code

Previously the shuffler used `rechunk_stream_by_size` to accumulate and
concat incoming batches before sorting with `take`, producing two
full-size data copies in sequence and peaking at ~3-4× batch_size_bytes.

Replace both steps with a single pass:
- Accumulate incoming batches in a `Vec<RecordBatch>` without concat.
- `sort_to_interleave_indices` builds `(part_id, batch_idx, row_idx)`
  tuples over the UInt32 part-id columns only, sorts them in one
  `sort_unstable` call, and returns `(batch_idx, row_idx)` pairs +
  per-partition counts.
- The sorted output is streamed to the data file via `interleave_batches`
  in fixed-size chunks (8 Ki rows), so the interleave output never
  exceeds a small constant fraction of the source data.

Peak memory drops to ~1× batch_size_bytes, which also allows increasing
`LANCE_SHUFFLE_BATCH_BYTES` aggressively to reduce the number of flush
groups and improve read-time I/O patterns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 16, 2026
@wjones127

Copy link
Copy Markdown
Contributor Author

Benchmark: TwoFileShuffler memory & throughput vs batch size

Run against 500k rows × 128-dim float32 (244 MB vectors, 1024 partitions):

batch_size flush groups elapsed peak RSS
16 MB 16 227 ms 323 MB
32 MB 8 188 ms 348 MB
64 MB 4 358 ms 350 MB
128 MB 2 200 ms 346 MB
256 MB 1 178 ms 352 MB
512 MB 1 157 ms 364 MB

Flush groups (fewer = less random I/O at query time)

 16 MB │████████████████  16
 32 MB │████████           8
 64 MB │████               4
128 MB │██                 2
256 MB │█                  1
512 MB │█                  1

Why peak RSS looks flat

The benchmark pre-generates all 244 MB of test data before any shuffle runs, so the OS RSS baseline already reflects that allocation. The marginal memory attributable to the shuffler itself is small and gets lost in the noise. The real benefit shows up in the theoretical model:

batch_size new peak overhead old peak overhead
16 MB ~16 MB ~48–64 MB
32 MB ~32 MB ~96–128 MB
128 MB ~128 MB ~384–512 MB
512 MB ~512 MB ~1.5–2 GB

The old code did concat_batches (1× copy) + take (another 1× copy), giving a ~3–4× multiplier on batch_size_bytes at peak. The new code accumulates source batches (1×), sorts to index pairs, then streams interleave_batches in 8192-row chunks — the working set stays at ~1× batch_size throughout.

Practical impact

With the old 3–4× overhead, setting batch_size_bytes = 1 GB on a 4 GB machine would require 3–4 GB just for the shuffle overhead, exhausting memory. With the new code, 1 GB batch size costs ~1 GB, meaning you can use the full available RAM. For this 244 MB dataset, that reduces flush groups from 16 → 1, cutting random-access I/O per query partition by 16×.

Reproduce
# (from lance-index/ directory, bench not checked in)
cargo bench --bench shuffler_mem 2>&1

@github-actions github-actions Bot added the A-python Python bindings label Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.70588% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/v3/shuffler.rs 89.70% 2 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

@wjones127

wjones127 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Memory benchmark: TwoFileShuffler interleave-sort

Methodology

Two approaches:

  1. Rust micro-benchmark (benches/shuffler_mem.rs, untracked): 500K rows × 128-dim float32 (244 MB), varying LANCE_SHUFFLE_BATCH_BYTES. Measures RSS overhead above baseline (input data already in memory). Compares the old concat_batches + sort_to_indices + take path (inlined) against the new sort_to_interleave_indices + interleave_batches path via TwoFileShuffler.

  2. Python end-to-end (test_indexing.py::test_io_mem_shuffler_batch_size): IVF_PQ index build on ~1M rows × 3072-dim float32 (~204 MB PQ-encoded). Peak heap measured via memtest (LD_PRELOAD) over the full index build. Run on both main (8.0.0-beta.7) and this branch (8.0.0-beta.11).


Rust micro-benchmark (shuffler phase in isolation)

Dataset: 500K rows × 128-dim float32, 244 MB, 1024 partitions.
Overhead = peak RSS above baseline (input already allocated before timing starts).

Batch size Flush groups Old overhead New overhead
64 MB 4 182 MB ~0 MB
128 MB 2 382 MB ~0 MB
256 MB 1 726 MB ~0 MB
512 MB 1 728 MB ~0 MB

Old code scales at ~3× batch_size (accumulated input + concat copy + sorted-take copy). At 256 MB with 244 MB total data, old code peaks at ~3× 244 MB ≈ 732 MB extra above baseline.

New code overhead is essentially zero: the sort buffer (n × 12 bytes) and streaming 8192-row interleave chunks are negligible.

16 MB and 32 MB show noisy results for the new code (rapid flush cycles at 5 ms RSS sampling granularity). ≥64 MB cases are representative.


Python end-to-end (main vs branch)

Dataset: ~1M rows × 3072-dim float32, ~204 MB PQ-encoded (index replaced between runs).

Batch size Main peak Branch peak Delta
16 MB 10.07 GB 9.99 GB −82 MB
32 MB 9.91 GB 9.92 GB +14 MB
64 MB 10.00 GB 10.04 GB +35 MB
128 MB 10.04 GB 9.25 GB −790 MB

End-to-end peak is dominated by k-means training (~9–10 GB), which masks most of the shuffler improvement. The 128 MB case shows the clearest signal (−790 MB), consistent with the micro-benchmark prediction of ~382 MB shuffler overhead eliminated plus measurement variance. The smaller batch sizes show measurement noise in both directions.

The improvement is most impactful for production-scale workloads where users set LANCE_SHUFFLE_BATCH_BYTES to 2–8 GB to reduce flush groups and improve read locality.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes TwoFileShuffler (vector v3) to reduce peak memory during IVF shuffling by avoiding full-data concatenation/take copies and instead sorting only partition-id columns to produce interleave indices, then streaming the sorted output via interleave_batches in fixed-size chunks.

Changes:

  • Replaces the previous rechunk_stream_by_size + concat_batches + take approach with a sort over (part_id, batch_idx, row_idx) keys to generate interleave indices.
  • Streams sorted output to shuffle_data.lance in SHUFFLE_WRITE_CHUNK_ROWS chunks using interleave_batches, and writes corresponding per-flush-group offsets to shuffle_offsets.lance.
  • Simplifies loss tracking by summing per-input-batch metadata directly (no Mutex/atomic counters).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +496 to +501
let mut total_loss = 0.0f64;
let mut accumulated: Vec<RecordBatch> = Vec::new();
let mut acc_bytes: usize = 0;

let mut rechunked = std::pin::pin!(rechunked);
while let Some(batch) = rechunked.next().await {
num_batches_ref.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let mut data = std::pin::pin!(data);
while let Some(batch) = data.next().await {
Comment on lines +439 to +449
let pid = *part_id as usize;
if pid < num_partitions {
partition_counts[pid] += 1;
} else {
log::warn!(
"Partition ID {} is out of range [0, {})",
pid,
num_partitions
);
}
interleave_indices.push((*batch_idx as usize, *row_idx as usize));
Comment on lines +570 to +578
/// Sorts `accumulated` batches by partition ID and writes the result to the data
/// and offsets files.
///
/// Returns `(total_rows_written, per_partition_row_counts)`.
async fn flush_shuffle_batch(
accumulated: Vec<RecordBatch>,
file_writer: &mut FileWriter,
offsets_writer: &mut FileWriter,
offsets_schema: Arc<Schema>,
Comment on lines +426 to +438
let total_rows: usize = part_id_columns.iter().map(|a| a.len()).sum();
let mut keys: Vec<(u32, u32, u32)> = Vec::with_capacity(total_rows);
for (batch_idx, col) in part_id_columns.iter().enumerate() {
let batch_idx = batch_idx as u32;
for (row_idx, &part_id) in col.values().iter().enumerate() {
keys.push((part_id, batch_idx, row_idx as u32));
}
}
keys.sort_unstable_by_key(|k| k.0);

let mut partition_counts = vec![0u64; num_partitions];
let mut interleave_indices = Vec::with_capacity(total_rows);
for (part_id, batch_idx, row_idx) in &keys {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer A-python Python bindings performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce TwoFileShuffler peak memory from 3-4x to 1x batch_size_bytes

2 participants