Example: spillable PageStore for the Parquet ArrowWriter#15
Closed
alamb wants to merge 1 commit into
Closed
Conversation
Adds `parquet/examples/spill_page_store.rs`, a runnable demonstration of the pluggable `PageStore` API. It implements a spilling `TempFilePageStore` (one temp file per column chunk) and writes a wide, skewed Parquet file — a few Int64 columns, some small (~20 byte) string columns, and a configurable number of large (~8 KiB) string columns — into a single row group, reporting peak `ArrowWriter::memory_size()` with and without spilling. --large-string-columns N number of fat ~8 KiB string columns (default 10) --spill use the spilling TempFilePageStore vs the default On the default 10 fat columns the spilling store cuts peak writer memory from ~161 MiB (whole row group buffered) to ~21 MiB (in-flight encoder buffers only), and produces a byte-identical file to the in-memory path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
Superseded by apache#10058, which targets apache/arrow-rs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this?
A stacked, draft PR on top of apache#10020 ("Pluggable page spilling API for the Parquet ArrowWriter"), added to help review that PR by exercising its new public
PageStoreAPI from the outside — as an end user would.It adds one file,
parquet/examples/spill_page_store.rs, that implements a real spilling backend and demonstrates the memory win.The example
Writes a wide, skewed Parquet file into a single row group:
Int64columns (--int-columns, default 3)--small-string-columns, default 5)--large-string-columns, default 10)and reports peak
ArrowWriter::memory_size()with and without a spilling page store.The spilling backend is a ~30-line
TempFilePageStore(one unlinked temp file per column chunk):putappends a page blob and returns an opaquePageKey,takeseeks and reads it back, andmemory_size()keeps its default of0because the bytes now live in the file, not on the heap.Flags
--large-string-columns N— number of fat ~8 KiB string columns (default 10)--spill— use the spillingTempFilePageStoreinstead of the default in-memory buffering--small-string-columns,--int-columns,--rows,--batch-size,--output <path>Running
What it shows
On the defaults (10 fat columns, ~160 MiB row group):
ArrowWriter::memory_size()TempFilePageStore(--spill)i.e. spilling bounds peak writer memory by the in-flight encoder buffers rather than the row group size. Widen the skew with
--large-string-columns 30to make the gap bigger. Writing the same data to a real file with--outputproduces a byte-identical Parquet file in both modes, confirming the spilling path is a transparent drop-in.Notes for review
tempfileandsysinfoare alreadydev-dependenciesof theparquetcrate, so the example needs no new deps; it is gated onrequired-features = ["arrow", "cli"].TempFilePageStorehere is intentionally the same shape as the one inparquet/tests/arrow_writer.rs, so the example and the test corroborate each other.🤖 Generated with Claude Code