feat: new AsyncGeoData protocol, RasterioReader bytes-path knobs, AsyncGeoTIFFReader#54
feat: new AsyncGeoData protocol, RasterioReader bytes-path knobs, AsyncGeoTIFFReader#54jejjohnson wants to merge 7 commits into
Conversation
… on GeoDataBase Adds a new ``AsyncGeoData`` abstract class to ``georeader/abstract_reader.py`` that mirrors ``GeoData`` with ``async`` read methods. Concrete async readers (e.g. the upcoming ``AsyncGeoTIFFReader``) satisfy this interface so user code can branch on sync-vs-async without isinstance checks. While here, deduplicate the derived metadata properties that were previously copy-pasted across ``GeoData`` and would have been copy-pasted again on ``AsyncGeoData``: - ``bounds``, ``res``, ``footprint`` move up to ``GeoDataBase`` (they only need ``transform``, ``crs``, ``shape`` — all already on ``GeoDataBase``). - ``GeoData`` and ``AsyncGeoData`` keep only the surface that genuinely differs per tier: sync vs async ``load`` / ``read_from_window``, plus the read-tier metadata stubs (``dtype``, ``dims``, ``fill_value_default``). No behaviour change for existing ``GeoData`` consumers — ``GeoData.bounds`` etc. still resolve, just one inheritance level higher. ``GeoTensor`` is unaffected (no inheritance from these classes). Tests: adds ``TestAsyncGeoData`` covering inherited defaults (``bounds``, ``res``, ``footprint``) and verifying ``load`` / ``read_from_window`` raise ``NotImplementedError`` on a bare subclass. Full suite: 780 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RasterioReader` previously routed all reads through GDAL VSI (libcurl in C) with no seam to plug in an alternative byte transport. That's fine for the common case but offers no way to opt into fsspec for niche backends (FTP, SFTP, GitHub, MinIO with custom auth) or a user-supplied callback for custom HTTP clients / refreshable tokens. Add three keyword-only constructor knobs that translate into the rasterio `opener=` parameter, plus an escape hatch for arbitrary extra kwargs: - `opener=callable` — passed straight to `rasterio.open(opener=...)`. The callable must accept `(path, mode="rb")` — rasterio 1.4 calls it as `opener(path)` so the mode default is load-bearing. - `fs=fsspec.AbstractFileSystem` — shortcut equivalent to `opener=fs.open`. - `rio_open_kwargs=dict` — escape hatch for arbitrary additional kwargs forwarded to every `rasterio.open(...)` call. `opener=` and `fs=` are mutually exclusive — passing both raises `ValueError` at construction. Implementation: - New private helper `_resolve_open_kwargs()` returns the kwargs dict to splat at every `rasterio.open(path, ...)` call site. - Threaded through all 7 `rasterio.open(...)` call sites in the file. - All 4 recursive `RasterioReader(...)` constructions (in `read_from_window`, `isel`, `__copy__`, `reader_overview`) forward the three knobs so they survive across spawned sub-readers. - Bump `rasterio` floor from `>=1` to `>=1.4` (the version that introduced `opener=`). Docs: - New `docs/advanced/bytes_path_knobs.ipynb` — fully executable end-to-end demo against a local fixture, exercising all three paths plus the mutually-exclusive validation. - Sidebar in `docs/read_S2_SAFE_from_bucket.ipynb` flagging the knobs exist for cloud reads (pseudocode only — the executable demo is the new advanced notebook). - Register the advanced notebook in `mkdocs.yml`. Tests: adds `TestBytesPathKnobs` covering the default path (no knobs), opener callback round-trip, fsspec shortcut round-trip, mutually-exclusive validation, and kwargs surviving the recursive construction in `read_from_window`. Full suite: 785 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…geotiff
`AsyncGeoTIFFReader` provides async, COG-only reads for high-concurrency
fan-out workloads (tile servers, async ML inference services). It is an
~80-LOC adapter on top of `async-geotiff` (DevSeed) — we don't re-implement
IFD walk, tile-fetch math, decompression dispatch, or request coalescing.
That all lives upstream and we depend on it.
The reader conforms to `AsyncGeoData` (added in the previous commit) so user
code typed against the protocol can swap sync ↔ async readers without
isinstance checks. Same metadata property names as `RasterioReader`
(`crs`, `transform`, `shape`, `dtype`, `bounds`, `fill_value_default`,
`dims`); same method names (`read_from_window`, `read_from_bounds`, `load`)
but each is a coroutine.
Construction is two-phase:
- `AsyncGeoTIFFReader(path, store=...)` — cheap, no I/O
- `await AsyncGeoTIFFReader.open(...)` — fetches the COG header (IFD chain)
After `open()`, sync metadata properties work instantly (just reads off the
already-fetched header). The first pixel-byte fetch happens on the first
`await reader.read_from_window(...)` / `load()`. The `_geotiff` handle is
kept alive between reads, so the header is parsed exactly once per reader
(unlike `RasterioReader`, which opens fresh per call for multi-process
safety). Trade-off: faster repeated reads, not pickleable across processes.
Anti-goals (raise `NotImplementedError`):
- `read_from_bounds(target_crs=...)` — async-geotiff explicitly disclaims
warp; users either post-warp via `georeader.read.read_reproject_like` or
fall back to `RasterioReader` with WarpedVRT.
- `read_from_bounds(target_resolution=...)` — same reasoning, no resample.
Dependencies:
- New optional `[async]` extra pulls in `async-geotiff>=0.5,<0.6` (and its
transitive `async-tiff` + `obspec` chain). Pinned to the 0.5.x line
because the upstream API is pre-1.0 and may shift between minors.
- Users still pick an `obstore` backend themselves (`S3Store` / `GCSStore` /
`AzureStore` / `LocalStore`); the right one depends on their cloud.
- Dev group adds `pytest-asyncio`, `obstore`, and `async-geotiff` so the
tests run in CI.
Tests: `pytest.importorskip("async_geotiff")` gates the whole module so
lean environments skip cleanly. Eight tests cover metadata-after-open,
RuntimeError-before-open, parity with `RasterioReader.read_from_window`
(numerical equality, not just shape), full-load parity, the warp/resample
NotImplementedError boundary, `asyncio.gather` concurrent fan-out across
16 windows, `async with` context manager, and `__repr__` status.
Full suite: 793 passed (8 new + 785 existing).
Note: `poetry.lock` is not regenerated by this commit — all new deps are
optional or dev-only, so the base resolution is unchanged. Run
`poetry lock --no-update` pre-merge to refresh the lockfile metadata.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…notebooks Adds a full tutorial notebook for `AsyncGeoTIFFReader` and threads short "async alternative" sidebars into three existing notebooks so users coming from the sync path discover the async sibling at the relevant entry points. New notebook `docs/advanced/async_geotiff_reader.ipynb`: - Two HTML/CSS box-flow diagrams that render natively in both Jupyter and mkdocs without any extension — one for `RasterioReader` (three-path branch: GDAL VSI / opener / fsspec) and one for `AsyncGeoTIFFReader` (linear chain through async-geotiff → async-tiff → obspec → storage). Color-coded by layer responsibility (user code / our package / external dep / storage). - A "which reader should I use" decision table. - End-to-end demo against a local fixture (with a built overview pyramid) showing the two-phase laziness model, sync metadata properties, numerical parity with `RasterioReader`, the `overview_level` knob with side-by-side shape / resolution / byte-size comparisons, concurrent fan-out via `asyncio.gather`, and the `async with` context manager. - A "what this reader does NOT do" section showing `NotImplementedError` on `target_crs=` / `target_resolution=`, followed by a mini-solution for the common case: load native then warp post-step via `read.read_to_crs` / `read.read_reproject_like`. - A tips/gotchas section covering the two-phase laziness, the explicit (not auto-picked) overview-level semantics, multi-process pickleability caveats, the `store=` requirement, the inverted mask convention, and format scope (TIFF/COG only). Sidebars added to existing notebooks: - `notebooks/read_from_tileserver.ipynb` — full executable cell sequence against Element 84's public `sentinel-cogs` S3 bucket. Opens a real Sentinel-2 L2A COG (10980x10980 uint16 with 4 overviews), then issues 16 concurrent window reads via `asyncio.gather`. Markdown is explicit that XYZ tiles and COG windows are different protocols — not a 1:1 swap on the same input. - `docs/advanced/tiling_and_stitching.ipynb` — markdown sidebar with an `asyncio.gather` sketch for the per-tile read loop (model inference stays sync; only the reads parallelise). - `docs/read_S2_SAFE_from_bucket.ipynb` — markdown sidebar with a `GCSStore` pseudocode block, cross-linking to the intro notebook. `mkdocs.yml` is updated to register the new tutorial under "Tutorials → Advanced". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an async reader stack to georeader: introduces AsyncGeoData abstract class (with shared metadata properties hoisted to GeoDataBase), AsyncGeoTIFFReader as a thin adapter over developmentseed/async-geotiff, and adds opener= / fs= / rio_open_kwargs= bytes-path knobs to RasterioReader. Includes new tests, two new advanced notebooks, and sidebars in existing notebooks.
Changes:
- New
AsyncGeoDataprotocol + refactor of derived metadata (bounds,res,footprint) up toGeoDataBase. - New
AsyncGeoTIFFReader(~80 LOC adapter) under new optional[async]extra; bumpsrasteriofloor to>=1.4. - New
opener=/fs=/rio_open_kwargs=keyword-only constructor knobs onRasterioReaderthreaded through all 7rasterio.opensites and all 4 recursive constructions.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| georeader/abstract_reader.py | Adds AsyncGeoData; moves bounds/res/footprint to GeoDataBase. |
| georeader/async_geotiff_reader.py | New thin async COG reader satisfying AsyncGeoData. |
| georeader/rasterio_reader.py | Adds keyword-only opener/fs/rio_open_kwargs and threads them through all open sites and sub-reader constructions. |
| pyproject.toml | Bumps rasterio floor; adds [async] extra; adds pytest-asyncio and asyncio_mode=strict. |
| mkdocs.yml | Registers the two new advanced tutorial notebooks. |
| tests/test_abstract_reader.py | Adds TestAsyncGeoData covering inherited defaults + NotImplementedError. |
| tests/test_rasterio_reader.py | Adds TestBytesPathKnobs covering default/opener/fs/exclusivity/recursive forwarding. |
| tests/test_async_geotiff_reader.py | New test file (importorskip-gated) for AsyncGeoTIFFReader. |
| docs/advanced/bytes_path_knobs.ipynb | New executable tutorial for the three bytes-path knobs. |
| docs/advanced/async_geotiff_reader.ipynb | New executable tutorial for AsyncGeoTIFFReader. |
| docs/advanced/tiling_and_stitching.ipynb | Adds async fan-out sidebar; re-encoded unicode glyphs in author/citation. |
| docs/read_S2_SAFE_from_bucket.ipynb | Adds bytes-path-knobs + async sidebars; re-encoded unicode glyphs in author/citation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CI's `poetry install --with dev,tutorial` fails because pyproject.toml gained new entries (`async-geotiff` optional dep, `[async]` extra, `pytest-asyncio` + `obstore` + `async-geotiff` in dev) that are not reflected in poetry.lock. Regenerated with Poetry 2.4.1 (matching CI's `version: latest`). Lockfile format is preserved; only the new dep entries and their transitive closure are added. Existing pins are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
PR overview
This new reader is a great addition, I'm very excited to use it. I think the following comments should enhance the class and it makes it more aligned with the rest of the package.
AsyncGeoTIFFReader class
Remove read_from_bounds from AsyncGeoTIFFReader, it's not part of RasterioReader API nor in the other readers. For reading from bounds theres the function in the read module instead. (read.read_from_bounds method which shall work with this reader as input btw regardless of the crs passed because it reprojects the bounds not the data).
Maybe worth to have the same repr as RasterioReader if the reader is open for consistency? (Basically showing bounds, crs etc)
The boundless option in read_from_window shall be honored. It should produce a GeoTensor with smaller size if the window is in the edges of the raster (so if AsyncGeoTIFFReader pads always an easy fix would be to intersect the window with the raster window before the read if boundless is False). Also this behavior shall be explicitly tested.
Test the read methods in the read module against an AsyncGeoTIFFReader reader
Could you test the following methods in the read module with an AsyncGeoTIFFReader object as input:
• read_from_bounds
• read_from_polygon
• read_from_center_coords
• read_from_tile
• read_reproject(_like)
These tests already exist for RasterioReader object. I'd use the same configuration but with AsyncGeoTIFFReader. Instead of copy pasting, please consider refactor the tests to systematically test with the two readers.
notebooks and docs
Are you sure the AsyncGeoTIFFReader works with jp2 data? Can you make a explicit example in the read_S2_SAFE_from_bucket notebook? I would remove that from this example otherwise. Also,the SAFE reader works only with RasterioReader as is implemented now. I guess it could be configured to have an option of which reader to use... This tutorial is runnable so this behavior can be tested (because jp2 files in gcp can be anonymously read)
The notebook of read from tile server actually demonstrates a completely different thing (reading from a tile server and stitching). So it has nothing to do with this reader. Please move the example to a separate notebook. There's an old example to read S2 files from element84 public Amazon bucket, maybe this example could be moved there? (And updating that notebook? It's in notebooks/Sentinel-2 folder)
In the AsyncGeoTIFFReader notebook example, to create the COG fixture, you can use from georeader.save import save_cog which is less verbose than using rasterio primitives.
Maybe also that notebook would be cool to surface that all read methods work with the reader.
…/block_windows parity Restructure AsyncGeoTIFFReader to mirror RasterioReader's laziness pattern: read_from_window is now sync and returns a windowed AsyncGeoTIFFReader view (no I/O); load() is async and performs the actual fetch. This makes the reader work polymorphically with the entire read.* module — read_from_window, read_from_bounds, read_from_polygon, read_from_center_coords, read_from_tile, and read_reproject(_like) — the latter via the pre-load pattern (`await reader.load()` then pass the GeoTensor; isinstance(data_in, GeoTensor) short-circuit at read.py:1605 skips internal sync materialisation). New methods matching RasterioReader: - overviews() / reader_overview(level): introspect and pin overview level - block_windows(): tile-aligned iteration for fan-out reads aligned with the COG's internal tile grid Other reader changes: - window_focus attribute tracks the current view's window - _raster_window property replaces three inline Window(0, 0, w, h) constructions (parallels RasterioReader.real_window) - fill_value_default falls back to 0 when the COG has no nodata tag (matches RasterioReader's nodata-if-not-none-else-0 default) - Boundless padding routed through window_utils.get_slice_pad + GeoTensor.pad() (same pattern as GeoTensor.read_from_window) — replaces the previous bespoke np.full + offset-placement code - Rich __repr__ when opened; aligned multi-line Affine formatting (also fixed in RasterioReader.__repr__) Protocol: - AsyncGeoData.read_from_window is sync now (returns view), aligning with GeoData.read_from_window. Only load() remains async. Tests (suite: 926 passed, 0 skipped, 0 failed; was 793): - AsyncGeoTIFFReader: 19 reader-specific tests — overviews, reader_overview, block_windows, nested views, explicit nodata, focused bounds, read-before-open, boundless edge windows, concurrent fan-out - New cog_with_nodata_and_overviews fixture for paths the default fixture doesn't reach - test_read_dataarray.py parametrized across both readers via a reader_and_materialize fixture — covers all read.* functions including read_reproject(_like) via the pre-load pattern (no aread_* siblings needed) - New polymorphic coverage: cross-CRS bounds, cross-CRS polygon, pad_add, return_only_data (sync-only by design) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Notebook changes (responding to gonzmg88's review on PR spaceml-org#54): - docs/advanced/async_geotiff_reader.ipynb — rebuilt from 29 → 39 cells: - Added a 60-second async/await primer + "when is async worth it?" section for users new to async Python - Replaced raw rasterio fixture construction with `save_cog` (with `BLOCKSIZE=64` so a 256×256 raster still gets overviews) - Documented the view+load pattern explicitly with a quick-reference table; cleaned stale references to the removed `reader.read_from_bounds(target_crs=...)` method - Added a `read.*` compatibility matrix with three categories (✅ direct /⚠️ pre-load required / ❌ not supported) and three runnable demos covering the cases - Added a `block_windows` tile-aligned fan-out demo (the actual recommended pattern for tile servers) - Added a bandwidth-conscious tile-reads section showing how to compose `read_from_bounds` + `read_to_crs` to fetch only the tile-region instead of pre-loading the whole COG - Replaced the hand-rolled "print each property" cell with `print(reader)` (rich __repr__) + programmatic-access assertions - Dropped the misleading "~80 LOC adapter" claim from the diagram and the module docstring; described scope honestly - docs/read_S2_SAFE_from_bucket.ipynb — replaced the JP2-implying pseudocode sidebar with an upfront limitation note ("AsyncTiffException: unexpected magic bytes" — async-geotiff is TIFF-only, JP2 is not supported) plus a real runnable example against the Element 84 L2A COG bucket - notebooks/read_from_tileserver.ipynb — reverted: dropped the 4-cell AsyncGeoTIFFReader sidebar that didn't belong (notebook is about XYZ tile stitching, an unrelated protocol) - notebooks/Sentinel-2/read_s2_safe_element84_cloud.ipynb — appended the Element 84 async fan-out demo in the right place (alongside the existing pystac + S2_SAFE_reader content) Bug fix uncovered while writing the compatibility-matrix proof: - georeader/read.py:1832-1835 — `read.read_from_tile` had an inverted intersection check (`else: return` was returning None for *intersecting* tiles, falling through to the rest of the function only for non-intersecting tiles). The existing parametrized test passed by accident because of an `if chip_out is None: return` early-out. Swapped the control flow to match the docstring's promise; tightened the test to assert non-None for a center-of-raster tile so the regression can't recur silently. With the fix, `read.read_from_tile` joins `read_reproject` / `read_reproject_like` / `read_to_crs` in the⚠️ pre-load column for async readers (the function falls through to `read_reproject` when the reader has no native `read_from_tile` method). Suite still: 926 passed, 0 skipped, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Okay, so now I have a better understanding of the problem, thanks for this notebook and for the aysnc for newbies introduction! I still find disapointing that the functions that reproject do not work without pre-loading. One potential solution would be to create async sibilings of all ProposalCreate an |
User story
This PR adds that. After it lands, the user does:
Same
crs/transform/shape/read_from_window/loadsurface asRasterioReader. Justawait-able.Motivation
Three pressures, none new:
RasterioReaderships one path (GDAL VSI) and has no seam to plug in alternatives — fsspec for niche backends,obstorefor parallel ranges.RasterioReaderis sync-only; users currently roll their own.async-tiff) skips per-call GDAL state, batches parallel range requests throughobstore, and coalesces close-by ranges. Specialising for the dominant cloud-native format pays for itself once.The full design is in
research_journal_v2/notes/geotoolz/plans/georeader/. Critical decision: we do not reimplement COG plumbing.developmentseed/async-geotiffalready ships IFD walk, tile-fetch math, decompression, request coalescing — we depend on it via an ~80-LOC adapter.What's included
Four commits, each independently reviewable:
7fde631AsyncGeoDataabstract class. Deduplicates derived-property defaults (bounds,res,footprint) up toGeoDataBase—GeoDataandAsyncGeoDatakeep only what genuinely differs per tier. No behaviour change for existing callers.06fa1a4opener=/fs=/rio_open_kwargs=keyword-only knobs toRasterioReader. Threaded through all 7rasterio.open(...)sites and all 4 recursiveRasterioReader(...)constructions. Defaults reproduce today's GDAL VSI behaviour exactly. Bumps therasteriofloor to>=1.4(needed foropener=).99da95bAsyncGeoTIFFReader— ~80-LOC adapter overasync-geotiff. Conforms toAsyncGeoData. New optional[async]extra.pytest-asyncioadded to the dev group.ea7ed4dAsyncGeoTIFFReader, async sidebars in three existing notebooks (read_from_tileserver,tiling_and_stitching,read_S2_SAFE_from_bucket),mkdocs.ymlregistration.Highlights
crs,transform,shape,bounds,dtype,fill_value_default,dims,footprint(crs)are all identical onRasterioReaderandAsyncGeoTIFFReader. The only difference isawait-ability on the read methods.AsyncGeoTIFFReaderis two-phase lazy.__init__is free;await open()fetches just the COG header (cheap); pixel bytes are fetched on eachread_*call. Same model asRasterioReadersemantically — different in that the header is parsed once per reader (vsRasterioReader's fresh open perread(), which is what keeps it pickleable acrossmultiprocessing/joblib/ Dask).AsyncGeoTIFFReaderis TIFF/COG only.read_from_bounds(target_crs=...)raisesNotImplementedErrorwith a clear pointer atgeoreader.read.read_to_crsfor the post-load warp pattern (demonstrated in the intro notebook).async-geotiffexplicitly disclaims warp; we follow suit rather than pulling GDAL back into the async cone.RasterioReadercallers. Defaults reproduce GDAL VSI exactly. The three new knobs are purely additive.async-geotiffAPI surface verified. Before writing the adapter I spiked the actualasync-geotiff 0.5.0API and adjusted the design (e.g.geotiff.countinstead ofifd.samples_per_pixel,GeoTIFF.dtypealready returnsnp.dtype,store=is required not optional).overview_levelwalkthrough with byte-size comparison, realasyncio.gatherfan-out, theNotImplementedErrorboundary, and a mini-solution for warp-after-load.Testing plan
TestAsyncGeoData(4 tests) — inherited defaults fromGeoDataBase,NotImplementedErroron the abstract async methods.TestBytesPathKnobs(5 tests) — default GDAL VSI unchanged, opener round-trip, fsspec round-trip, mutually-exclusive validation, kwargs survive recursive construction.test_async_geotiff_reader.py(8 tests,pytest.importorskip-gated) — metadata-after-open,RuntimeError-before-open, numerical parity withRasterioReader.read_from_window, full-load parity, warp/resampleNotImplementedErrorboundary,asyncio.gatherfan-out,async with,__repr__.docs/advanced/bytes_path_knobs.ipynb— local fixture, all three pathsdocs/advanced/async_geotiff_reader.ipynb— local fixture with overviewsnotebooks/read_from_tileserver.ipynb— real public Sentinel-2 COG via Element 84'ssentinel-cogsbucket, 16 concurrent readspoetry.lockis not regenerated by these commits since all new deps are optional ([async]extra) or dev-only — the base resolution is unchanged. Runpoetry lock --no-updatebefore merging to refresh the lockfile metadata.🤖 Generated with Claude Code