Plan: Virtual Icechunk datasets#621
Conversation
Design document covering architecture, workflow, and implementation sequence for creating virtual Icechunk datasets backed by remote GRIB archives. Key decisions: Materialized/Virtual class hierarchy, polling watcher with generator-based lifecycle, GribberishCodec for read-time GRIB decode, lazy dimension expansion, and commit batching. https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss
| if not dimension_expanded: | ||
| self.expand_dimensions(store) | ||
| dimension_expanded = True |
There was a problem hiding this comment.
this needs to actually check if a dimension expansion is needed. also analysis dataset will expand on every single update so a bool isn't right.
Expansion now checks per-batch whether new append-dim values are introduced, rather than using a one-shot boolean. This correctly handles analysis datasets where every batch adds new time steps. https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss
| - **Spatial/map-optimized chunking**: Chunks follow the native GRIB message shape (1 time step, full spatial grid), ideal for spatial queries and map rendering. | ||
| - **All source variables**: No storage cost means we can include every variable in the source archive, not just a curated subset. | ||
| - **Very low latency updates**: Target < 30s (60s acceptable). Writing virtual references is near-instant since we're only recording byte offsets, not transferring data. | ||
| - **Complementary access patterns**: Users choose the materialized `-timeseries` dataset for time-series extraction or the `-spatial` dataset for map/spatial queries over the same underlying data. |
There was a problem hiding this comment.
Combine with first bullet about spatial optimized
| #### What's specialized | ||
|
|
||
| **MaterializedDynamicalDataset** adds: | ||
| - CLI: `backfill_kubernetes`, `backfill_local`, `process_backfill_region_jobs`, `update` |
There was a problem hiding this comment.
Is backfill_* definitely specific to materialized/virtual and not general?
|
|
||
| **MaterializedRegionJob** adds: | ||
| - `download_file()`, `read_data()`, `apply_data_transformations()` | ||
| - `process()`: shared memory buffer → download → read → transform → write shards → upload |
There was a problem hiding this comment.
Shouldn't both kinds have a process method they just implement it very differently?
| - `tmp_store` for local shard staging | ||
|
|
||
| **VirtualDynamicalDataset** adds: | ||
| - CLI: `virtual_update`, `virtual_backfill_kubernetes`, `virtual_backfill_local` |
There was a problem hiding this comment.
Why can't these be named the same as the materialized methods but implemented differently. I think that would allow greater code reuse in dynamicaldataset
Overall comment: look at the methods we have that are similar on the materialized and virtual variants and see if they are/can be just different implementations of the same abc methods making the cli interface simpler and allowing us to move more into DynamcialDataset vs materialized/virtual abc subclasses
| - Virtual chunk container configuration (per-source S3/HTTP store config) | ||
| - `authorize_virtual_chunk_access` credential setup |
There was a problem hiding this comment.
These might have to go in storage.py
| Subclasses implement this generator. It `yield`s batches of virtual refs and controls the job lifecycle: | ||
|
|
||
| ```python | ||
| # Subclass (e.g., GefsSpatialRegionJob) |
There was a problem hiding this comment.
| # Subclass (e.g., GefsSpatialRegionJob) | |
| # Subclass (e.g., NoaaGefsAnalysisMaterializedRegionJob) |
| session.commit(...) | ||
| ``` | ||
|
|
||
| ### poll_virtual_refs() generator |
There was a problem hiding this comment.
| ### poll_virtual_refs() generator | |
| ### process_virtual_refs() generator |
| ### Commit batching | ||
|
|
||
| Two thresholds, whichever triggers first: | ||
| - `max_seconds_between_commits` (e.g., 10s for operational, 60s+ for backfill) | ||
| - `max_files_per_commit` (e.g., 5 for operational, 50+ for backfill) |
There was a problem hiding this comment.
Note design specifics uncertain-- is this actually 4 attrs, 2 each for update vs backfill? How does the region job know which to use?
| - Committed refs are durable and visible to readers | ||
| - On restart, the `filter_already_ingested` step detects what's already done | ||
| - The watcher resumes polling for remaining files | ||
| - Setting a ref that's already set is safe (idempotent) |
There was a problem hiding this comment.
But this won't be happening in most operations because of filtering
| Backfills use the same `VirtualRegionJob` code path with: | ||
| - Much looser commit batch limits (more files and longer intervals between commits) | ||
| - Kubernetes indexed jobs for parallelism (same pattern as materialized backfills) | ||
| - No polling — all files already exist, so the generator yields immediately | ||
| - Worker distribution via the existing round-robin `get_worker_jobs()` mechanism |
There was a problem hiding this comment.
Clarify that this is just one code path, we don't actually need branches (one except: to use different commit batch limits)
- Drop the Materialized/Virtual DynamicalDataset split — single base. Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob to host their substantial per-variant code. - Reframe the update process around indexed CronJobs (same scheduling pattern as materialized): worker 0 expands dims on main; workers fill chunks in parallel; no temp branch for operational updates. Avoids concurrent dim-expansion conflicts on init_time coord chunks. - Promote filtering-already-ingested from open question to the core efficiency mechanism for steady-state updates (region jobs span shards that are mostly already populated). - Walk through a concurrent-update scenario explicitly and explain why ConflictDetector accepts it. Call out the integration test PR #2 needs to verify icechunk 2.x rebase semantics. - Document three options for per-variable serializer (encoding factory, metadata-only common config, inherit-and-replace) instead of picking one prematurely. - Minimize __main__.py surface: source virtual chunk containers declared on the VirtualRegionJob class; store factory picks them up automatically. - Address each unresolved review comment from Alden inline. - Add appendices with concrete code patterns from PR #511, context from PR #510, and an inventory of existing infrastructure we reuse. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Worker-0 expansion on main exposed empty/NaN positions to consumers between expansion-commit and refs-fill-commit. Especially bad for analysis datasets (no good answer for "how far to expand the future into NaNs"). Replace with lazy expansion: each region job's process() expands the dim only when a batch introduces new append-dim values, atomic with the same commit as the corresponding refs. In steady state only one region job per cron fire ever needs to expand (the one with the newest init / newest time chunk). Other jobs write to existing chunks only, so write sets are disjoint and ConflictDetector accepts concurrent commits without retry. Cover the catchup edge case (multiple new append-dim values arriving together) with two options for PR #3 to pick: app-level retry in process() or route catchup through the backfill flow. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Previous draft routed multi-new-init scenarios through a "catchup edge case" with two divergent strategies (app-level retry vs backfill fallback). That's operational complexity for no good reason. Replace with a single uniform model: every batch commit opens a fresh session, computes refs against current state (filter-already-ingested + lazy index lookup), expands the dim if needed, sets refs, commits. On ConflictDetector rejection, throw away the session and retry with a fresh one — the retry's "recompute against current state" step naturally picks up whatever the other pod committed and recomputes target indices. Retries are cheap: byte ranges from parsed index files are already in hand, set_virtual_ref calls are microseconds, only the chunk-key indices need recomputation. Steady state sees ~0 conflicts; the rare multi-new-init scenario pays a few extra retries and converges. Update PR #2 integration tests to verify this uniform model converges in both the disjoint-write and concurrent-expansion scenarios. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
…check Two corrections to the previous revision: 1. Filter-already-ingested runs ONCE at the top of process(), not per batch. By region partitioning, no other pod is processing this pod's work, so re-filtering is wasted effort. 2. The per-batch retry loop just opens a fresh session, re-checks whether expansion is still needed (it may not be — another pod may have already expanded enough to cover our value), recomputes chunk-key indices against current state, re-issues set_virtual_ref calls, commits. The key property: expansion is idempotent across pods. Each pod expands the dim to cover everything in the template it depends on (including values it isn't ingesting but that must exist to keep sorted order). After any pod commits its expansion, the dim ends up in the same target state. Losers' retries see "expansion already done" and skip it. Only path collisions are on dim-expansion metadata files. Chunk writes are always disjoint by construction. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Architecture simplification: - Drop the MaterializedRegionJob rename. RegionJob stays as the materialized base; VirtualRegionJob is a sibling, not a refactor. Smaller diff, zero existing-dataset churn. Missing pieces added: - Derived coordinates section. _expand_dimensions also runs the template config's derive_coordinates and updates valid_time, expected_forecast_length etc. for new positions. - ingested_forecast_length progress tracking — bumped per batch commit; this is what _filter_already_ingested reads. - Replica writes — per-batch retry covers all stores; commit replicas first then primary (same as existing pattern). - Validation strategy — new manifest-check and progress-coord validators replace check_for_expected_shards and NaN scans. Sample-read validator as a follow-up. - K8s resource sizing note in PR #3 (virtual pods need much less than materialized). Complexity removed: - Three encoding-per-var options → one recommendation (factory pattern) plus two-line alternatives-considered. - Polling-within-fire dropped from process_virtual_refs. - ClassVar tuples for commit thresholds → simple model fields set by get_jobs / operational_update_jobs. - Storage configuration section trimmed from inline implementation to design statement + small example. Loose ends tightened: - is_backfill plumbing: parameter on get_jobs (defaults True), operational_update_jobs passes False. - Region partitioning with shards=None: virtual encoding sets shards as the region-partition unit (icechunk ignores it; only iterating.dimension_slices reads it). - update_template_with_results = no-op for virtual. - Gribberish version pin: smoke-test against icechunk 2.0.3 before committing to a specific version. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Latency model: - Drop the 60s/cron-every-minute framing. Target is ≤5s end-to-end. - Pod is scheduled to be up and polling before publication windows start (init publication times are predictable). - process_virtual_refs polls with sleep(1) and yields each newly-observed file as a single-file batch. - Operational region jobs default to max_files_per_commit=1 for per-file visibility. Backfills keep the looser thresholds. - Latency breakdown: ~1s poll + ~500ms index download + ~1s commit (+ ~1-2s for rare rebase) = 2-3s typical, ≤5s worst case. Restore MaterializedRegionJob + VirtualRegionJob siblings under RegionJob: - Today's RegionJob.process() body and helpers move down to a new MaterializedRegionJob. Existing dataset region jobs swap base. - VirtualRegionJob is a peer, not a subclass of RegionJob alone. - Explicit materialized/virtual naming at every reference site. - The extraction lands in PR #1 alongside the rest of the mechanical prep work. Version specifier: gribberish pinned with ~= (matching the existing ~=2.0, ~=2026.0, ~=33.1 style in pyproject.toml). https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
…p batch max_files_per_commit can be large even for operational; it just caps the size of any single batch. max_seconds_between_commits is what drives per-file visibility latency (e.g. 1s for operational). When the source publishes a small burst of files together, process_virtual_refs can yield them in one batch and they commit together. When files trickle in, the time threshold fires per file. https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
File count per commit is bounded by region job size and we always want to commit everything pending — there's no scenario where we'd deliberately cap a batch. The only thing we actually want to control is how long to wait before committing what we have: - Operational: ~1s for fast visibility - Backfill: ~60s to reduce commit overhead at volume https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Summary
This adds a comprehensive design document for virtual Icechunk datasets — a new dataset variant that stores only metadata (byte offsets) pointing to source GRIB files, enabling spatial/map-optimized chunking and very low-latency updates without materializing data.
Key changes
DynamicalDataset,RegionJob) with specialized subclasses for materialized and virtual variants, enabling code reuse while keeping concerns separatedVirtualRegionJob.process()drives a generator-basedpoll_virtual_refs()that handles lazy dimension expansion, commit batching, and crash recovery.idxfiles), ECMWF (JSON Lines.index), and DWD (bz2-compressed GRIBs with chained codecs)Notable details
-spatialsuffix (e.g.,noaa-gefs-forecast-35-day-spatial) to indicate spatial/map-optimized access patternGribberishCodecserializer, one GRIB message per chunk, no shards or compressorsVirtualRegionJobcode path with looser commit batching; operational updates use tight batching for <30s latencydynamical_catalogThis is a planning document (no code changes) that establishes the design foundation for virtual dataset implementation across multiple PRs.
https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss