Skip to content

feat(waterdata): Migrate to httpx and add async parallel chunker#285

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration
Draft

feat(waterdata): Migrate to httpx and add async parallel chunker#285
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 21, 2026

Summary

Stacked on top of the chunker-unified work (PR #283); rebase to main once that lands.

  • Replaces requests with httpx package-wide.
  • Adds an opt-in async parallel branch to the multi-value chunker via API_USGS_CONCURRENT (default 16 for the server-friendly sweet spot; set =1 for the legacy sequential path).
  • Integrates the async path with the ChunkPlan / ChunkedCall arch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drives ChunkedCall.resume() over one shared httpx.Client; parallel uses _fan_out_async to iterate the same plan via asyncio.gather + asyncio.Semaphore over one shared httpx.AsyncClient.

Benchmarked at 2.48× speedup on a 19,602-site / 6-state get_daily call at API_USGS_CONCURRENT=16 (2.78s vs 6.89s sequential, distinct date windows per trial so cache reuse can't bias the result).

Why httpx

httpx ships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path. requests is unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.

API_USGS_CONCURRENT

Value Behavior
unset / blank parallel, cap = 16 (_CONCURRENCY_DEFAULT)
≥ 2 parallel, semaphore-capped at that value
1 serial (sync ChunkedCall.resume() path)
unbounded parallel, no per-call cap — caller owns the burst risk
0, negative, malformed ValueError at call time

Connection-pool sharing across all sub-requests of a single chunked call in both modes via the _chunked_session (sync) / _chunked_async_session (async) ContextVars — _walk_pages / _walk_pages_async / get_stats_data read them as fallbacks before opening a fresh client.

Parallel path safety contracts

The parallel branch preserves the same safety contracts the serial path provides:

  • Probe-first quota check. _fan_out_async issues the first sub-request alone, reads x-ratelimit-remaining from its response, and raises RequestExceedsQuota before dispatching the rest if the remaining plan can't fit the window. Matches ChunkedCall._check_quota_after_first.
  • Resumable interruptions. asyncio.gather runs with return_exceptions=True, so a sibling's transient failure (RateLimited / ServiceUnavailable) doesn't lose the completed work. The raised ChunkInterrupted.call is a ChunkedCall holding the sparse-indexed completed sub-requests; exc.call.resume() re-issues only the unfinished indices via the sync fetch_once path.
  • Event-loop detection. asyncio.run() raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper calls asyncio.get_running_loop() first and, when one is active, falls back to the serial path with a UserWarning instead of crashing.
  • Missing-fetch_async warning. If API_USGS_CONCURRENT requests parallel but the decorator wasn't wired with fetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.

Three httpx behavior diffs handled defensively

  • httpx.InvalidURL raised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by _safe_request_bytes (treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again in ChunkPlan.__init__ so canonical-URL recovery can fall through to a worst-case sub-request URL.
  • httpx.Response.elapsed only populated on close (not by httpx.MockTransport / pytest-httpx). _safe_elapsed falls back to timedelta(0).
  • httpx.Response.url is a read-only property. _set_response_url rewrites it via the bound request, with a fallback path for Mock-shaped test responses.

Backwards-compat

  • BaseMetadata.header is now httpx.Headers instead of requests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds because httpx.Headers carries auto-added entries (content-type, content-length).
  • BaseMetadata.url is coerced to str (previously str on requests.Response; now str(httpx.Response.url)).
  • API_USGS_CONCURRENT defaults to 16 (parallel). Set =1 to opt back into the sequential path.

Test plan

  • 367 mocked tests pass after migrating to pytest-httpx. tests/conftest.py (new) is a requests_mock-shaped shim over httpx_mock (URL-prefix match, complete_qs strict-mode parity, request-history view); an autouse fixture pins API_USGS_CONCURRENT=1 so the historical mocked suite stays on the deterministic serial path.
  • Async-path coverage: four new async-mode test functions (one parametrized over running-loop and missing-async) in tests/waterdata_chunking_test.py opt out of the serial-pin autouse and exercise (a) successful fan-out, (b) probe-first RequestExceedsQuota, (c) resumable ServiceInterrupted.call after a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback emits a UserWarning and runs serial, (e) the missing-fetch_async warning fires when the env asks for parallel.
  • ruff check and ruff format --check pass.
  • Live-API CI sweep — most pre-existing column-drift failures (get_daily / get_stats_* / get_channel) are resolved by the schema-aware _get_resp_data / _handle_stats_nesting rewrite incorporated via the rebase.

Out of scope (follow-ups)

  • High-concurrency memory: _fan_out_async materializes all (df, response) pairs before combining. Consider streaming-combine via asyncio.as_completed if users push concurrency very high.
  • NEWS.md entry — left for the merger to draft.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the httpx-migration branch 16 times, most recently from c8d648a to 77f8601 Compare May 23, 2026 23:12
Replace ``requests`` with ``httpx`` package-wide and add an opt-in
async parallel fan-out for the multi-value chunker, gated on the
``API_USGS_CONCURRENT`` env var.

* ``httpx`` ships sync and async clients on a unified API, so the
  same request shape powers both the synchronous getters callers
  use today and the new ``_fan_out_async`` parallel path; the
  unmaintained ``requests`` had no async story.
* ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial
  ``ChunkedCall.resume()`` path over one shared ``httpx.Client``.
  ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or
  ``unbounded`` fans the plan out through ``_fan_out_async`` over
  one shared ``httpx.AsyncClient``, bounded by
  ``asyncio.Semaphore(N)``.
* Both paths publish their client on a ``ContextVar``
  (``_chunked_session`` / ``_chunked_async_session``) so paginated
  helpers downstream reuse the connection pool across every
  sub-request of a chunked call.
* The parallel path preserves the same safety contracts as the
  serial path: it probes the first sub-request alone to read
  ``x-ratelimit-remaining`` before fanning out the rest
  (``RequestExceedsQuota``), and uses ``asyncio.gather(
  return_exceptions=True)`` so a transient failure surfaces as a
  ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding
  the sparse-indexed completed sub-requests; ``exc.call.resume()``
  re-issues only the unfinished ones via the sync path.
* The wrapper falls back to the serial path (with a
  ``UserWarning``) when ``asyncio.get_running_loop()`` returns —
  so Jupyter / IPython kernels and async apps don't see a
  confusing ``RuntimeError`` — and when the decorator was set up
  without a ``fetch_async=`` sibling.
* Three defensive helpers smooth over httpx behaviours that
  ``requests`` didn't have: ``_safe_request_bytes`` swallows
  ``httpx.InvalidURL`` so the planner's halving loop keeps
  shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls
  back to ``timedelta(0)`` when ``.elapsed`` is missing (mock
  transports); ``_set_response_url`` rewrites the URL via the
  bound request, since httpx makes ``Response.url`` read-only.

Tests: ``pyproject.toml`` switches ``requests``/``requests-mock``
to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a
``requests_mock``-shaped shim over ``httpx_mock`` and an autouse
fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests
stay on the deterministic serial path. New async-mode tests cover
the parallel fan-out, the probe-first quota check, the resumable
``ChunkInterrupted.call`` after a mid-fan-out failure, the
running-event-loop fallback, and the missing-``fetch_async``
warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant