The TwoFileShuffler (#6169) writes only two files (shuffle_data.lance and shuffle_offsets.lance) regardless of partition count: it accumulates batch_size_bytes, sorts that block by partition, appends it to shuffle_data.lance, and writes per-partition offsets to shuffle_offsets.lance.
Peak memory is currently 3-4x batch_size_bytes. Reduce it to ~1x by using zero-copy slicing of the input data and streaming the sorted data out, rather than materializing additional copies.
Draft PR: #7295
The
TwoFileShuffler(#6169) writes only two files (shuffle_data.lanceandshuffle_offsets.lance) regardless of partition count: it accumulatesbatch_size_bytes, sorts that block by partition, appends it toshuffle_data.lance, and writes per-partition offsets toshuffle_offsets.lance.Peak memory is currently 3-4x
batch_size_bytes. Reduce it to ~1x by using zero-copy slicing of the input data and streaming the sorted data out, rather than materializing additional copies.Draft PR: #7295