Skip to content

Plan: Virtual Icechunk datasets#621

Draft
aldenks wants to merge 10 commits into
mainfrom
claude/plan-virtual-icechunk-7eaub
Draft

Plan: Virtual Icechunk datasets#621
aldenks wants to merge 10 commits into
mainfrom
claude/plan-virtual-icechunk-7eaub

Conversation

@aldenks
Copy link
Copy Markdown
Member

@aldenks aldenks commented May 25, 2026

Summary

This adds a comprehensive design document for virtual Icechunk datasets — a new dataset variant that stores only metadata (byte offsets) pointing to source GRIB files, enabling spatial/map-optimized chunking and very low-latency updates without materializing data.

Key changes

  • Architecture design: Introduces base classes (DynamicalDataset, RegionJob) with specialized subclasses for materialized and virtual variants, enabling code reuse while keeping concerns separated
  • Virtual reference workflow: Documents the proven byte-range parsing and virtual chunk writing pattern from PR Add GFS virtual icechunk dataset prototype #511, reusing existing index parsers and GRIB element metadata
  • Operational model: Defines a polling watcher pattern where VirtualRegionJob.process() drives a generator-based poll_virtual_refs() that handles lazy dimension expansion, commit batching, and crash recovery
  • Provider-specific guidance: Details index formats and codec requirements for NOAA (.idx files), ECMWF (JSON Lines .index), and DWD (bz2-compressed GRIBs with chained codecs)
  • Implementation roadmap: Five-PR plan starting with class renames, extracting base classes, implementing virtual base classes, then proving the pattern with concrete datasets
  • Open questions: Flags design decisions deferred to implementation (filtering already-ingested coordinates, DWD codec verification, polling configuration, variable expansion strategy)

Notable details

  • Virtual datasets use -spatial suffix (e.g., noaa-gefs-forecast-35-day-spatial) to indicate spatial/map-optimized access pattern
  • Encoding rules are uniform across providers: GribberishCodec serializer, one GRIB message per chunk, no shards or compressors
  • Backfills reuse the same VirtualRegionJob code path with looser commit batching; operational updates use tight batching for <30s latency
  • All targeted source archives have anonymous read access, simplifying reader setup via dynamical_catalog
  • Prototype achieved 15.5KB on disk for what would be 350MB materialized, demonstrating storage efficiency

This is a planning document (no code changes) that establishes the design foundation for virtual dataset implementation across multiple PRs.

https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss

Design document covering architecture, workflow, and implementation
sequence for creating virtual Icechunk datasets backed by remote GRIB
archives. Key decisions: Materialized/Virtual class hierarchy, polling
watcher with generator-based lifecycle, GribberishCodec for read-time
GRIB decode, lazy dimension expansion, and commit batching.

https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss
Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
Comment on lines +186 to +188
if not dimension_expanded:
self.expand_dimensions(store)
dimension_expanded = True
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to actually check if a dimension expansion is needed. also analysis dataset will expand on every single update so a bool isn't right.

Expansion now checks per-batch whether new append-dim values are
introduced, rather than using a one-shot boolean. This correctly
handles analysis datasets where every batch adds new time steps.

https://claude.ai/code/session_01Y4LZ8bF4utT1QkUMoYu5Ss
@aldenks aldenks changed the title Plan: Virtual Icechunk datasets for spatial-optimized access Plan: Virtual Icechunk datasets May 25, 2026
Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
- **Spatial/map-optimized chunking**: Chunks follow the native GRIB message shape (1 time step, full spatial grid), ideal for spatial queries and map rendering.
- **All source variables**: No storage cost means we can include every variable in the source archive, not just a curated subset.
- **Very low latency updates**: Target < 30s (60s acceptable). Writing virtual references is near-instant since we're only recording byte offsets, not transferring data.
- **Complementary access patterns**: Users choose the materialized `-timeseries` dataset for time-series extraction or the `-spatial` dataset for map/spatial queries over the same underlying data.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine with first bullet about spatial optimized

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
#### What's specialized

**MaterializedDynamicalDataset** adds:
- CLI: `backfill_kubernetes`, `backfill_local`, `process_backfill_region_jobs`, `update`
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is backfill_* definitely specific to materialized/virtual and not general?

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated

**MaterializedRegionJob** adds:
- `download_file()`, `read_data()`, `apply_data_transformations()`
- `process()`: shared memory buffer → download → read → transform → write shards → upload
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't both kinds have a process method they just implement it very differently?

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
- `tmp_store` for local shard staging

**VirtualDynamicalDataset** adds:
- CLI: `virtual_update`, `virtual_backfill_kubernetes`, `virtual_backfill_local`
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't these be named the same as the materialized methods but implemented differently. I think that would allow greater code reuse in dynamicaldataset

Overall comment: look at the methods we have that are similar on the materialized and virtual variants and see if they are/can be just different implementations of the same abc methods making the cli interface simpler and allowing us to move more into DynamcialDataset vs materialized/virtual abc subclasses

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
Comment on lines +67 to +68
- Virtual chunk container configuration (per-source S3/HTTP store config)
- `authorize_virtual_chunk_access` credential setup
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might have to go in storage.py

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
Subclasses implement this generator. It `yield`s batches of virtual refs and controls the job lifecycle:

```python
# Subclass (e.g., GefsSpatialRegionJob)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Subclass (e.g., GefsSpatialRegionJob)
# Subclass (e.g., NoaaGefsAnalysisMaterializedRegionJob)

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
session.commit(...)
```

### poll_virtual_refs() generator
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### poll_virtual_refs() generator
### process_virtual_refs() generator

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
Comment on lines +247 to +251
### Commit batching

Two thresholds, whichever triggers first:
- `max_seconds_between_commits` (e.g., 10s for operational, 60s+ for backfill)
- `max_files_per_commit` (e.g., 5 for operational, 50+ for backfill)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note design specifics uncertain-- is this actually 4 attrs, 2 each for update vs backfill? How does the region job know which to use?

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
- Committed refs are durable and visible to readers
- On restart, the `filter_already_ingested` step detects what's already done
- The watcher resumes polling for remaining files
- Setting a ref that's already set is safe (idempotent)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this won't be happening in most operations because of filtering

Comment thread docs/plans/virtual_icechunk_datasets.md Outdated
Comment on lines +271 to +275
Backfills use the same `VirtualRegionJob` code path with:
- Much looser commit batch limits (more files and longer intervals between commits)
- Kubernetes indexed jobs for parallelism (same pattern as materialized backfills)
- No polling — all files already exist, so the generator yields immediately
- Worker distribution via the existing round-robin `get_worker_jobs()` mechanism
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this is just one code path, we don't actually need branches (one except: to use different commit batch limits)

claude added 8 commits May 25, 2026 13:45
- Drop the Materialized/Virtual DynamicalDataset split — single base.
  Keep sibling MaterializedRegionJob/VirtualRegionJob under RegionJob
  to host their substantial per-variant code.
- Reframe the update process around indexed CronJobs (same scheduling
  pattern as materialized): worker 0 expands dims on main; workers
  fill chunks in parallel; no temp branch for operational updates.
  Avoids concurrent dim-expansion conflicts on init_time coord chunks.
- Promote filtering-already-ingested from open question to the core
  efficiency mechanism for steady-state updates (region jobs span
  shards that are mostly already populated).
- Walk through a concurrent-update scenario explicitly and explain why
  ConflictDetector accepts it. Call out the integration test PR #2
  needs to verify icechunk 2.x rebase semantics.
- Document three options for per-variable serializer (encoding factory,
  metadata-only common config, inherit-and-replace) instead of picking
  one prematurely.
- Minimize __main__.py surface: source virtual chunk containers declared
  on the VirtualRegionJob class; store factory picks them up automatically.
- Address each unresolved review comment from Alden inline.
- Add appendices with concrete code patterns from PR #511, context from
  PR #510, and an inventory of existing infrastructure we reuse.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Worker-0 expansion on main exposed empty/NaN positions to consumers
between expansion-commit and refs-fill-commit. Especially bad for
analysis datasets (no good answer for "how far to expand the future
into NaNs"). Replace with lazy expansion: each region job's process()
expands the dim only when a batch introduces new append-dim values,
atomic with the same commit as the corresponding refs.

In steady state only one region job per cron fire ever needs to
expand (the one with the newest init / newest time chunk). Other
jobs write to existing chunks only, so write sets are disjoint and
ConflictDetector accepts concurrent commits without retry.

Cover the catchup edge case (multiple new append-dim values arriving
together) with two options for PR #3 to pick: app-level retry in
process() or route catchup through the backfill flow.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Previous draft routed multi-new-init scenarios through a "catchup
edge case" with two divergent strategies (app-level retry vs backfill
fallback). That's operational complexity for no good reason.

Replace with a single uniform model: every batch commit opens a fresh
session, computes refs against current state (filter-already-ingested
+ lazy index lookup), expands the dim if needed, sets refs, commits.
On ConflictDetector rejection, throw away the session and retry with
a fresh one — the retry's "recompute against current state" step
naturally picks up whatever the other pod committed and recomputes
target indices.

Retries are cheap: byte ranges from parsed index files are already
in hand, set_virtual_ref calls are microseconds, only the chunk-key
indices need recomputation. Steady state sees ~0 conflicts; the rare
multi-new-init scenario pays a few extra retries and converges.

Update PR #2 integration tests to verify this uniform model converges
in both the disjoint-write and concurrent-expansion scenarios.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
…check

Two corrections to the previous revision:

1. Filter-already-ingested runs ONCE at the top of process(), not per
   batch. By region partitioning, no other pod is processing this
   pod's work, so re-filtering is wasted effort.

2. The per-batch retry loop just opens a fresh session, re-checks
   whether expansion is still needed (it may not be — another pod
   may have already expanded enough to cover our value), recomputes
   chunk-key indices against current state, re-issues set_virtual_ref
   calls, commits.

The key property: expansion is idempotent across pods. Each pod
expands the dim to cover everything in the template it depends on
(including values it isn't ingesting but that must exist to keep
sorted order). After any pod commits its expansion, the dim ends
up in the same target state. Losers' retries see "expansion already
done" and skip it.

Only path collisions are on dim-expansion metadata files. Chunk
writes are always disjoint by construction.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Architecture simplification:
- Drop the MaterializedRegionJob rename. RegionJob stays as the
  materialized base; VirtualRegionJob is a sibling, not a refactor.
  Smaller diff, zero existing-dataset churn.

Missing pieces added:
- Derived coordinates section. _expand_dimensions also runs the
  template config's derive_coordinates and updates valid_time,
  expected_forecast_length etc. for new positions.
- ingested_forecast_length progress tracking — bumped per batch
  commit; this is what _filter_already_ingested reads.
- Replica writes — per-batch retry covers all stores; commit
  replicas first then primary (same as existing pattern).
- Validation strategy — new manifest-check and progress-coord
  validators replace check_for_expected_shards and NaN scans.
  Sample-read validator as a follow-up.
- K8s resource sizing note in PR #3 (virtual pods need much less
  than materialized).

Complexity removed:
- Three encoding-per-var options → one recommendation (factory
  pattern) plus two-line alternatives-considered.
- Polling-within-fire dropped from process_virtual_refs.
- ClassVar tuples for commit thresholds → simple model fields set
  by get_jobs / operational_update_jobs.
- Storage configuration section trimmed from inline implementation
  to design statement + small example.

Loose ends tightened:
- is_backfill plumbing: parameter on get_jobs (defaults True),
  operational_update_jobs passes False.
- Region partitioning with shards=None: virtual encoding sets
  shards as the region-partition unit (icechunk ignores it; only
  iterating.dimension_slices reads it).
- update_template_with_results = no-op for virtual.
- Gribberish version pin: smoke-test against icechunk 2.0.3 before
  committing to a specific version.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Latency model:
- Drop the 60s/cron-every-minute framing. Target is ≤5s end-to-end.
- Pod is scheduled to be up and polling before publication windows
  start (init publication times are predictable).
- process_virtual_refs polls with sleep(1) and yields each
  newly-observed file as a single-file batch.
- Operational region jobs default to max_files_per_commit=1 for
  per-file visibility. Backfills keep the looser thresholds.
- Latency breakdown: ~1s poll + ~500ms index download + ~1s commit
  (+ ~1-2s for rare rebase) = 2-3s typical, ≤5s worst case.

Restore MaterializedRegionJob + VirtualRegionJob siblings under
RegionJob:
- Today's RegionJob.process() body and helpers move down to a new
  MaterializedRegionJob. Existing dataset region jobs swap base.
- VirtualRegionJob is a peer, not a subclass of RegionJob alone.
- Explicit materialized/virtual naming at every reference site.
- The extraction lands in PR #1 alongside the rest of the
  mechanical prep work.

Version specifier: gribberish pinned with ~= (matching the existing
~=2.0, ~=2026.0, ~=33.1 style in pyproject.toml).

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
…p batch

max_files_per_commit can be large even for operational; it just caps
the size of any single batch. max_seconds_between_commits is what
drives per-file visibility latency (e.g. 1s for operational).

When the source publishes a small burst of files together,
process_virtual_refs can yield them in one batch and they commit
together. When files trickle in, the time threshold fires per file.

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
File count per commit is bounded by region job size and we always
want to commit everything pending — there's no scenario where we'd
deliberately cap a batch. The only thing we actually want to control
is how long to wait before committing what we have:
- Operational: ~1s for fast visibility
- Backfill: ~60s to reduce commit overhead at volume

https://claude.ai/code/session_01YbsupHKaGd11C8RaW6gQVQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants