Skip to content

Example: spillable PageStore for the Parquet ArrowWriter#15

Closed
alamb wants to merge 1 commit into
pydantic:parquet-page-spillfrom
alamb:parquet-page-spill-example
Closed

Example: spillable PageStore for the Parquet ArrowWriter#15
alamb wants to merge 1 commit into
pydantic:parquet-page-spillfrom
alamb:parquet-page-spill-example

Conversation

@alamb

@alamb alamb commented Jun 3, 2026

Copy link
Copy Markdown

What is this?

A stacked, draft PR on top of apache#10020 ("Pluggable page spilling API for the Parquet ArrowWriter"), added to help review that PR by exercising its new public PageStore API from the outside — as an end user would.

It adds one file, parquet/examples/spill_page_store.rs, that implements a real spilling backend and demonstrates the memory win.

Base branch is pydantic:parquet-page-spill (the apache#10020 head), so this PR's diff is just the example. Review apache#10020 first.

The example

Writes a wide, skewed Parquet file into a single row group:

  • a few Int64 columns (--int-columns, default 3)
  • some small ~20-byte string columns (--small-string-columns, default 5)
  • a configurable pile of fat ~8 KiB string columns (--large-string-columns, default 10)

and reports peak ArrowWriter::memory_size() with and without a spilling page store.

The spilling backend is a ~30-line TempFilePageStore (one unlinked temp file per column chunk): put appends a page blob and returns an opaque PageKey, take seeks and reads it back, and memory_size() keeps its default of 0 because the bytes now live in the file, not on the heap.

Flags

  • --large-string-columns N — number of fat ~8 KiB string columns (default 10)
  • --spill — use the spilling TempFilePageStore instead of the default in-memory buffering
  • also: --small-string-columns, --int-columns, --rows, --batch-size, --output <path>

Running

# Baseline: default in-memory page buffering
cargo run --release --features cli --example spill_page_store

# Spill completed pages to temp files
cargo run --release --features cli --example spill_page_store -- --spill

What it shows

On the defaults (10 fat columns, ~160 MiB row group):

page buffering peak ArrowWriter::memory_size()
in-memory (default) ~161 MiB
TempFilePageStore (--spill) ~21 MiB

i.e. spilling bounds peak writer memory by the in-flight encoder buffers rather than the row group size. Widen the skew with --large-string-columns 30 to make the gap bigger. Writing the same data to a real file with --output produces a byte-identical Parquet file in both modes, confirming the spilling path is a transparent drop-in.

Notes for review

  • tempfile and sysinfo are already dev-dependencies of the parquet crate, so the example needs no new deps; it is gated on required-features = ["arrow", "cli"].
  • The TempFilePageStore here is intentionally the same shape as the one in parquet/tests/arrow_writer.rs, so the example and the test corroborate each other.

🤖 Generated with Claude Code

Adds `parquet/examples/spill_page_store.rs`, a runnable demonstration of the
pluggable `PageStore` API. It implements a spilling `TempFilePageStore` (one
temp file per column chunk) and writes a wide, skewed Parquet file — a few
Int64 columns, some small (~20 byte) string columns, and a configurable number
of large (~8 KiB) string columns — into a single row group, reporting peak
`ArrowWriter::memory_size()` with and without spilling.

  --large-string-columns N   number of fat ~8 KiB string columns (default 10)
  --spill                    use the spilling TempFilePageStore vs the default

On the default 10 fat columns the spilling store cuts peak writer memory from
~161 MiB (whole row group buffered) to ~21 MiB (in-flight encoder buffers
only), and produces a byte-identical file to the in-memory path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alamb

alamb commented Jun 3, 2026

Copy link
Copy Markdown
Author

Superseded by apache#10058, which targets apache/arrow-rs main so the PR bundles the apache#10020 content together with the example.

@alamb alamb closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant