Skip to content

Reduce TwoFileShuffler peak memory from 3-4x to 1x batch_size_bytes #7299

@wjones127

Description

@wjones127

The TwoFileShuffler (#6169) writes only two files (shuffle_data.lance and shuffle_offsets.lance) regardless of partition count: it accumulates batch_size_bytes, sorts that block by partition, appends it to shuffle_data.lance, and writes per-partition offsets to shuffle_offsets.lance.

Peak memory is currently 3-4x batch_size_bytes. Reduce it to ~1x by using zero-copy slicing of the input data and streaming the sorted data out, rather than materializing additional copies.

Draft PR: #7295

Metadata

Metadata

Assignees

Labels

A-indexVector index, linalg, tokenizerperformancerustRust related tasks

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions