Plan: Virtual Icechunk datasets by aldenks · Pull Request #621 · dynamical-org/reformatters

aldenks · 2026-05-25T03:08:39Z

Summary

This adds a comprehensive design document for virtual Icechunk datasets — a new dataset variant that stores only metadata (byte offsets) pointing to source GRIB files, enabling spatial/map-optimized chunking and very low-latency updates without materializing data.

Key changes

Architecture design: Introduces base classes (DynamicalDataset, RegionJob) with specialized subclasses for materialized and virtual variants, enabling code reuse while keeping concerns separated
Virtual reference workflow: Documents the proven byte-range parsing and virtual chunk writing pattern from PR Add GFS virtual icechunk dataset prototype #511, reusing existing index parsers and GRIB element metadata
Operational model: Defines a polling watcher pattern where VirtualRegionJob.process() drives a generator-based poll_virtual_refs() that handles lazy dimension expansion, commit batching, and crash recovery
Provider-specific guidance: Details index formats and codec requirements for NOAA (.idx files), ECMWF (JSON Lines .index), and DWD (bz2-compressed GRIBs with chained codecs)
Implementation roadmap: Five-PR plan starting with class renames, extracting base classes, implementing virtual base classes, then proving the pattern with concrete datasets
Open questions: Flags design decisions deferred to implementation (filtering already-ingested coordinates, DWD codec verification, polling configuration, variable expansion strategy)

Notable details

Virtual datasets use -spatial suffix (e.g., noaa-gefs-forecast-35-day-spatial) to indicate spatial/map-optimized access pattern
Encoding rules are uniform across providers: GribberishCodec serializer, one GRIB message per chunk, no shards or compressors
Backfills reuse the same VirtualRegionJob code path with looser commit batching; operational updates use tight batching for <30s latency
All targeted source archives have anonymous read access, simplifying reader setup via dynamical_catalog
Prototype achieved 15.5KB on disk for what would be 350MB materialized, demonstrating storage efficiency

This is a planning document (no code changes) that establishes the design foundation for virtual dataset implementation across multiple PRs.

https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss

Design document covering architecture, workflow, and implementation sequence for creating virtual Icechunk datasets backed by remote GRIB archives. Key decisions: Materialized/Virtual class hierarchy, polling watcher with generator-based lifecycle, GribberishCodec for read-time GRIB decode, lazy dimension expansion, and commit batching. https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss

aldenks · 2026-05-25T03:38:40Z

+        if not dimension_expanded:
+            self.expand_dimensions(store)
+            dimension_expanded = True


this needs to actually check if a dimension expansion is needed. also analysis dataset will expand on every single update so a bool isn't right.

Expansion now checks per-batch whether new append-dim values are introduced, rather than using a one-shot boolean. This correctly handles analysis datasets where every batch adds new time steps. https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss

aldenks · 2026-05-25T12:08:10Z

+- **Spatial/map-optimized chunking**: Chunks follow the native GRIB message shape (1 time step, full spatial grid), ideal for spatial queries and map rendering.
+- **All source variables**: No storage cost means we can include every variable in the source archive, not just a curated subset.
+- **Very low latency updates**: Target < 30s (60s acceptable). Writing virtual references is near-instant since we're only recording byte offsets, not transferring data.
+- **Complementary access patterns**: Users choose the materialized `-timeseries` dataset for time-series extraction or the `-spatial` dataset for map/spatial queries over the same underlying data.


Combine with first bullet about spatial optimized

aldenks · 2026-05-25T12:11:46Z

+#### What's specialized
+
+**MaterializedDynamicalDataset** adds:
+- CLI: `backfill_kubernetes`, `backfill_local`, `process_backfill_region_jobs`, `update`


Is backfill_* definitely specific to materialized/virtual and not general?

aldenks · 2026-05-25T12:13:03Z

+
+**MaterializedRegionJob** adds:
+- `download_file()`, `read_data()`, `apply_data_transformations()`
+- `process()`: shared memory buffer → download → read → transform → write shards → upload


Shouldn't both kinds have a process method they just implement it very differently?

aldenks · 2026-05-25T12:14:14Z

+- `tmp_store` for local shard staging
+
+**VirtualDynamicalDataset** adds:
+- CLI: `virtual_update`, `virtual_backfill_kubernetes`, `virtual_backfill_local`


Why can't these be named the same as the materialized methods but implemented differently. I think that would allow greater code reuse in dynamicaldataset

Overall comment: look at the methods we have that are similar on the materialized and virtual variants and see if they are/can be just different implementations of the same abc methods making the cli interface simpler and allowing us to move more into DynamcialDataset vs materialized/virtual abc subclasses

aldenks · 2026-05-25T12:35:51Z

+- Virtual chunk container configuration (per-source S3/HTTP store config)
+- `authorize_virtual_chunk_access` credential setup


These might have to go in storage.py

aldenks · 2026-05-25T12:45:02Z

+Subclasses implement this generator. It `yield`s batches of virtual refs and controls the job lifecycle:
+
+```python
+# Subclass (e.g., GefsSpatialRegionJob)


Suggested change

# Subclass (e.g., GefsSpatialRegionJob)

# Subclass (e.g., NoaaGefsAnalysisMaterializedRegionJob)

aldenks · 2026-05-25T12:45:15Z

+        session.commit(...)
+```
+
+### poll_virtual_refs() generator


Suggested change

### poll_virtual_refs() generator

### process_virtual_refs() generator

aldenks · 2026-05-25T12:52:46Z

+### Commit batching
+
+Two thresholds, whichever triggers first:
+- `max_seconds_between_commits` (e.g., 10s for operational, 60s+ for backfill)
+- `max_files_per_commit` (e.g., 5 for operational, 50+ for backfill)


Note design specifics uncertain-- is this actually 4 attrs, 2 each for update vs backfill? How does the region job know which to use?

aldenks · 2026-05-25T12:54:51Z

+- Committed refs are durable and visible to readers
+- On restart, the `filter_already_ingested` step detects what's already done
+- The watcher resumes polling for remaining files
+- Setting a ref that's already set is safe (idempotent)


But this won't be happening in most operations because of filtering

aldenks · 2026-05-25T12:56:48Z

+Backfills use the same `VirtualRegionJob` code path with:
+- Much looser commit batch limits (more files and longer intervals between commits)
+- Kubernetes indexed jobs for parallelism (same pattern as materialized backfills)
+- No polling — all files already exist, so the generator yields immediately
+- Worker distribution via the existing round-robin `get_worker_jobs()` mechanism


Clarify that this is just one code path, we don't actually need branches (one except: to use different commit batch limits)

- Drop the Materialized/Virtual DynamicalDataset split — single base. Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob to host their substantial per-variant code. - Reframe the update process around indexed CronJobs (same scheduling pattern as materialized): worker 0 expands dims on main; workers fill chunks in parallel; no temp branch for operational updates. Avoids concurrent dim-expansion conflicts on init_time coord chunks. - Promote filtering-already-ingested from open question to the core efficiency mechanism for steady-state updates (region jobs span shards that are mostly already populated). - Walk through a concurrent-update scenario explicitly and explain why ConflictDetector accepts it. Call out the integration test PR #2 needs to verify icechunk 2.x rebase semantics. - Document three options for per-variable serializer (encoding factory, metadata-only common config, inherit-and-replace) instead of picking one prematurely. - Minimize __main__.py surface: source virtual chunk containers declared on the VirtualRegionJob class; store factory picks them up automatically. - Address each unresolved review comment from Alden inline. - Add appendices with concrete code patterns from PR #511, context from PR #510, and an inventory of existing infrastructure we reuse. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

Worker-0 expansion on main exposed empty/NaN positions to consumers between expansion-commit and refs-fill-commit. Especially bad for analysis datasets (no good answer for "how far to expand the future into NaNs"). Replace with lazy expansion: each region job's process() expands the dim only when a batch introduces new append-dim values, atomic with the same commit as the corresponding refs. In steady state only one region job per cron fire ever needs to expand (the one with the newest init / newest time chunk). Other jobs write to existing chunks only, so write sets are disjoint and ConflictDetector accepts concurrent commits without retry. Cover the catchup edge case (multiple new append-dim values arriving together) with two options for PR #3 to pick: app-level retry in process() or route catchup through the backfill flow. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

Previous draft routed multi-new-init scenarios through a "catchup edge case" with two divergent strategies (app-level retry vs backfill fallback). That's operational complexity for no good reason. Replace with a single uniform model: every batch commit opens a fresh session, computes refs against current state (filter-already-ingested + lazy index lookup), expands the dim if needed, sets refs, commits. On ConflictDetector rejection, throw away the session and retry with a fresh one — the retry's "recompute against current state" step naturally picks up whatever the other pod committed and recomputes target indices. Retries are cheap: byte ranges from parsed index files are already in hand, set_virtual_ref calls are microseconds, only the chunk-key indices need recomputation. Steady state sees ~0 conflicts; the rare multi-new-init scenario pays a few extra retries and converges. Update PR #2 integration tests to verify this uniform model converges in both the disjoint-write and concurrent-expansion scenarios. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

…check Two corrections to the previous revision: 1. Filter-already-ingested runs ONCE at the top of process(), not per batch. By region partitioning, no other pod is processing this pod's work, so re-filtering is wasted effort. 2. The per-batch retry loop just opens a fresh session, re-checks whether expansion is still needed (it may not be — another pod may have already expanded enough to cover our value), recomputes chunk-key indices against current state, re-issues set_virtual_ref calls, commits. The key property: expansion is idempotent across pods. Each pod expands the dim to cover everything in the template it depends on (including values it isn't ingesting but that must exist to keep sorted order). After any pod commits its expansion, the dim ends up in the same target state. Losers' retries see "expansion already done" and skip it. Only path collisions are on dim-expansion metadata files. Chunk writes are always disjoint by construction. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

Architecture simplification: - Drop the MaterializedRegionJob rename. RegionJob stays as the materialized base; VirtualRegionJob is a sibling, not a refactor. Smaller diff, zero existing-dataset churn. Missing pieces added: - Derived coordinates section. _expand_dimensions also runs the template config's derive_coordinates and updates valid_time, expected_forecast_length etc. for new positions. - ingested_forecast_length progress tracking — bumped per batch commit; this is what _filter_already_ingested reads. - Replica writes — per-batch retry covers all stores; commit replicas first then primary (same as existing pattern). - Validation strategy — new manifest-check and progress-coord validators replace check_for_expected_shards and NaN scans. Sample-read validator as a follow-up. - K8s resource sizing note in PR #3 (virtual pods need much less than materialized). Complexity removed: - Three encoding-per-var options → one recommendation (factory pattern) plus two-line alternatives-considered. - Polling-within-fire dropped from process_virtual_refs. - ClassVar tuples for commit thresholds → simple model fields set by get_jobs / operational_update_jobs. - Storage configuration section trimmed from inline implementation to design statement + small example. Loose ends tightened: - is_backfill plumbing: parameter on get_jobs (defaults True), operational_update_jobs passes False. - Region partitioning with shards=None: virtual encoding sets shards as the region-partition unit (icechunk ignores it; only iterating.dimension_slices reads it). - update_template_with_results = no-op for virtual. - Gribberish version pin: smoke-test against icechunk 2.0.3 before committing to a specific version. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

Latency model: - Drop the 60s/cron-every-minute framing. Target is ≤5s end-to-end. - Pod is scheduled to be up and polling before publication windows start (init publication times are predictable). - process_virtual_refs polls with sleep(1) and yields each newly-observed file as a single-file batch. - Operational region jobs default to max_files_per_commit=1 for per-file visibility. Backfills keep the looser thresholds. - Latency breakdown: ~1s poll + ~500ms index download + ~1s commit (+ ~1-2s for rare rebase) = 2-3s typical, ≤5s worst case. Restore MaterializedRegionJob + VirtualRegionJob siblings under RegionJob: - Today's RegionJob.process() body and helpers move down to a new MaterializedRegionJob. Existing dataset region jobs swap base. - VirtualRegionJob is a peer, not a subclass of RegionJob alone. - Explicit materialized/virtual naming at every reference site. - The extraction lands in PR #1 alongside the rest of the mechanical prep work. Version specifier: gribberish pinned with ~= (matching the existing ~=2.0, ~=2026.0, ~=33.1 style in pyproject.toml). https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

…p batch max_files_per_commit can be large even for operational; it just caps the size of any single batch. max_seconds_between_commits is what drives per-file visibility latency (e.g. 1s for operational). When the source publishes a small burst of files together, process_virtual_refs can yield them in one batch and they commit together. When files trickle in, the time threshold fires per file. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

File count per commit is bounded by region job size and we always want to commit everything pending — there's no scenario where we'd deliberately cap a batch. The only thing we actually want to control is how long to wait before committing what we have: - Operational: ~1s for fast visibility - Backfill: ~60s to reduce commit overhead at volume https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ

aldenks commented May 25, 2026

View reviewed changes

aldenks changed the title ~~Plan: Virtual Icechunk datasets for spatial-optimized access~~ Plan: Virtual Icechunk datasets May 25, 2026

aldenks commented May 25, 2026

View reviewed changes

claude added 8 commits May 25, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan: Virtual Icechunk datasets#621

Plan: Virtual Icechunk datasets#621
aldenks wants to merge 10 commits into
mainfrom
claude/plan-virtual-icechunk-7eaub

aldenks commented May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

aldenks May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- Virtual chunk container configuration (per-source S3/HTTP store config)
		- `authorize_virtual_chunk_access` credential setup

	# Subclass (e.g., GefsSpatialRegionJob)
	# Subclass (e.g., NoaaGefsAnalysisMaterializedRegionJob)

	### poll_virtual_refs() generator
	### process_virtual_refs() generator

Conversation

aldenks commented May 25, 2026

Summary

Key changes

Notable details

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants