Add new Iceberg BatchedWriter that auto handles batches and is durabl…#1184
Merged
Add new Iceberg BatchedWriter that auto handles batches and is durabl…#1184
Conversation
| use crate::{Error, Result}; | ||
| use spool::Spool; | ||
|
|
||
| const DEFAULT_MAX_BATCH_SIZE: usize = 10_000; |
Contributor
There was a problem hiding this comment.
What unit is this?
glacon-rs uses the bytesize crate that provides some const constructors.
Collaborator
Author
There was a problem hiding this comment.
just the number of records, not bytesize
michaeldjeffrey
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
BatchedWritertohelium_icebergwith crash-recovery spoolingSummary
Introduces
BatchedWriter<T>— a batching layer overIcebergTable<T>that accumulates records, spools them to disk, and commits them to Iceberg in larger snapshots. Designed for streaming-ingestion call sites that today either commit per record (snapshot churn) or hand-roll their own buffering.Records are durably spooled in Arrow-IPC stream format on the way in, so a process abort between flushes is recoverable: on next startup the new task replays any leftover spool files for its table before accepting new traffic.
Public API
queue/queue_allare ack'd: they don't return until the records have been pushed through the spool'sBufWriterto the kernel page cache. After they return, the records survive a process abort.max_batch_size, whenbatch_timeoutelapses, on explicitflush(), ontriggered::Listenershutdown, or on channel close.flushed batch to iceberg table="ns.tbl" reason="size" records=10000 duration_ms=842. Reasons:"size","timeout","manual","shutdown","channel_closed".How the spool works
{namespace}__{table}__{uuid_v7}.arrowfile per task, opened lazily on first append (so a clean shutdown with nothing buffered leaves the dir empty).T → RecordBatch → StreamWriter::write → BufWriter::flush(kernel page cache). Wrapped inspawn_blockingbecausearrow-ipcis sync-only.StreamWriter::finish → File::sync_data(the only fsync) → read all batches back viaStreamReader→IcebergTable::write_record_batches→ delete file.Files changed
helium_iceberg/src/batched_writer/mod.rsBatchedWriter,BatchedWriterConfig,BatchedWriterTask,ManagedTaskimpl, run loop, log helper.helium_iceberg/src/batched_writer/spool.rsSpool— Arrow-IPC stream lifecycle, append, flush, replay.helium_iceberg/src/iceberg_table.rswrite_data_filesnow acceptsVec<RecordBatch>so one snapshot can absorb many spool batches; newIcebergTable::write_record_batchesskips the arrow-json round-trip;records_to_batchpromoted topub(crate).helium_iceberg/src/lib.rshelium_iceberg/Cargo.tomlarrow-ipc,task-manager,triggered;tempfileas dev-dep.helium_iceberg/tests/batched_writer.rsDesign decisions
IcebergTabledirectly viawrite_idempotent.DataWriter<T>.write_idempotentdoesn't fit batching semantics, so we don't pretend to.ManagedTaskintegration. MirrorsFileSink— fits the workspace shutdown story.spool_diris required. NoDefaultimpl; the builder'sBatchedWriterConfig::new(spool_dir)is the entry point.