Skip to content

1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages#100

Merged
philcunliffe merged 3 commits into
worktree-remote-config-join-flowfrom
fix/central-forward-oom
Jun 13, 2026
Merged

1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages#100
philcunliffe merged 3 commits into
worktree-remote-config-join-flowfrom
fix/central-forward-oom

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Summary

Two bugs in the v1.4.0 @hypaware/central forward sink left it unable to forward a real backlog. Found while standing up a fleet against a live central server: a gateway joining with a large existing cache (~552k rows) OOM-crash-looped the daemon every ~71s, and even after that was fixed the AI-gateway dataset turned out to have never been forwardable.

1. Unbatched export → OOM (the crash)

central/src/sink.js materialized each partition's entire row set into an in-memory array and then rows.map(JSON.stringify).join('\n') into one giant NDJSON string. On a large backlog this exhausts the heap (FATAL ERROR: Reached heap limit); the string would also exceed V8's max length. KeepAlive restarts the daemon, which OOMs again — a loop. The recorder flushes per-exchange synchronously, so local capture/hyp query kept working, which masked it; only scheduled exports failed and the server got 0 rows.

Fix: stream each partition through bounded chunks (row-count + byte budget, both far under the server's max body), never holding the whole table. Each chunk is an independent POST carrying a content-derived X-Hyp-Batch-Id, so the server's idempotency ledger dedupes re-sends — a partition re-handed after a transport failure converges to exactly-once (the prior code sent no batch id at all → at-least-once).

2. ai_gateway_messages had no sourceSignal (masked by the OOM)

The dataset declared no sourceSignal, so the forward sink fell back to the dataset name as the signal and threw unknown signal 'ai_gateway_messages'. The v1.4.0 smoke flow only exercised a synthetic logs dataset with an explicit signal, so this was never caught — the AI-gateway dataset has never been forwardable.

Fix: sourceSignal: 'proxy' on the ai-gateway dataset registration. The server already maps proxyai_gateway_messages.

Validation

  • New unit test test/plugins/central-forward-chunking.test.js (5 cases): chunk boundaries, one-POST-per-small-partition, deterministic/idempotent batch-ids, partition-level retry on transport failure, empty-batch no-op.
  • Existing central + sink suites green; typecheck clean on changed files.
  • End-to-end against a live server: 60,645 rows forwarded and queryable on the central server, daemon stable, zero new OOM.

Notes

  • proto.md updated for the chunked, idempotency-keyed ingest.
  • Version bumped to 1.4.1.
  • Targeting worktree-remote-config-join-flow (the line v1.4.0 shipped from), not master — the join flow has diverged from master.

Known follow-up (not in this PR)

Bulk backfill trips the server's per-gateway byte_rate 429. The forward sink aborts the whole partition on 429 and doesn't honor Retry-After (proto.md says it should). Needs client-side pacing and/or a higher HYPSERVER_BYTE_RATE_PER_GATEWAY. New (post-join) traffic is small and unaffected.

🤖 Generated with Claude Code

philcunliffe and others added 2 commits June 12, 2026 15:45
…ateway_messages under proxy

The central forward sink materialized each partition's full row set into
one in-memory NDJSON string before POSTing. A gateway joining a fleet
with a large existing cache (here ~552k rows) OOM-crash-looped the daemon
on its first export — the recorder kept flushing per-exchange so local
capture survived, but no scheduled export ever completed and the server
received zero rows.

Stream each partition through bounded chunks instead (row-count + byte
budget, both far under the server's max body), never holding the whole
table. Each chunk POSTs with a content-derived X-Hyp-Batch-Id so the
server's idempotency ledger dedupes re-sends: when the driver re-hands a
partition after a transport failure, already-delivered chunks ack 202 and
only the remainder is stored, converging a partial-then-retried partition
to exactly-once (the prior code sent no batch id at all — at-least-once).

Also give the ai_gateway_messages dataset `sourceSignal: 'proxy'`. It had
none, so the forward sink fell back to the dataset name as the signal and
threw `unknown signal 'ai_gateway_messages'` — the AI-gateway dataset has
never been forwardable. The OOM masked this; chunking exposed it. The
server already maps the `proxy` signal to ai_gateway_messages.

Validated end-to-end against a live server: 60k+ rows forwarded and
queryable, daemon stable, no OOM. proto.md updated for the chunked,
idempotent ingest. Adds test/plugins/central-forward-chunking.test.js.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex Batch-id collision for byte-identical chunks (major, sink.js:171,201,282; proto.md:144) Wire-contract risk (server ledger not in repo); Targets: batchIdForBody
Claude sourceSignal: 'proxy' fix has zero regression coverage (major, dataset.js:212) Risk bullet 3; Targets: aiGatewayDatasetRegistration
Claude MAX_CHUNK_BYTES bound never exercised (major, test:31 vs sink.js:180) Risk bullet 2; Targets: forwardPartition
Claude Stale "groups by signal" docs (minor, sink.js:22; proto.md:127) Targets: proto.md wire contract
Claude "LLP 0001" citation points at wrong doc (minor, sink.js:134) Targets: sink.js
Claude No chunk-level telemetry (minor, sink.js:164,79) Error-path observability of the new chunk loop
Claude 401-refresh same-batch-id resend untested (minor, sink.js:255) Wire-contract risk; Targets: postNdjson
Claude proto.md claims unimplemented Retry-After handling (minor, proto.md:153) Risk bullet 4
Codex review

Fix Validations

Central forward sink materialized whole partitions and OOMed on backlog

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164, hypaware-core/plugins-workspace/central/src/sink.js:180, hypaware-core/plugins-workspace/central/src/sink.js:184
  • Assessment: The sink now streams rows and flushes bounded chunks instead of materializing every partition row into one array/body. Existing sink-driver retry handling would not have fixed the OOM by itself because it only persists retry partitions after exportBatch returns or throws: src/core/sinks/driver.js:89, src/core/sinks/driver.js:123.

ai_gateway_messages forwarded under an unknown signal

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:205, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:214, hypaware-core/plugins-workspace/central/src/sink.js:150
  • Assessment: The AI-gateway dataset now resolves to the known proxy signal, avoiding the previous fallback to ai_gateway_messages. Existing retry handling would only have retried the same unknown-signal failure.

Findings

1) Behavioral Correctness

  • Severity: major
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:171, hypaware-core/plugins-workspace/central/src/sink.js:201, hypaware-core/plugins-workspace/central/src/sink.js:282, hypaware-core/plugins-workspace/central/proto.md:144
  • Why it matters: X-Hyp-Batch-Id is derived only from signal + body, so two legitimate byte-identical chunks for the same signal get the same id and the server contract says the repeated key is acked without re-storing, silently dropping the later chunk.
  • Suggested fix: Include stable occurrence context in the idempotency key, such as partition identity plus chunk index plus body hash, and add a regression test with two byte-identical chunks that must produce distinct batch ids while preserving retry determinism.

No Finding

  1. Contract & Interface Fidelity
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Security Surface
  6. Resource Lifecycle & Cleanup
  7. Release Safety
  8. Test Evidence Quality, aside from the duplicate-chunk regression tied to finding 1
  9. Architectural Consistency
  10. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: central forward sink export loop and POST path; AI-gateway dataset registration; central ingest protocol docs; package version.
  • Impacted callers: hypaware-core/plugins-workspace/central/index.js:63; src/core/sinks/driver.js:89; src/core/sinks/driver.js:123; src/core/sinks/driver.js:127.
  • Impacted tests: test/plugins/central-forward-chunking.test.js:97; test/plugins/central-forward-chunking.test.js:122; test/plugins/central-forward-chunking.test.js:132; hypaware-core/smoke/flows/central_forward_outbox.js:112.
  • Unresolved uncertainty: I did not inspect the server ledger implementation, only the updated protocol contract; I did not run the test suite for this review.
Claude review

Claude review

Bug #2 fix (sourceSignal: 'proxy') has zero regression coverage

  • Severity: major
  • Confidence: 87
  • Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:212, test/core/ai-gateway-dataset.test.js:38, test/plugins/central-forward-chunking.test.js:44
  • Why it matters: The one-line dataset fix the PR calls load-bearing ("Without this ... AI-gateway rows never leave the gateway") is asserted nowhere — ai-gateway-dataset.test.js checks cachePartitioning but not sourceSignal, and the new chunking tests hardcode makeQuery('logs'), so neither the proxy signal path nor the unknown-signal fallback throw is ever exercised; deleting the field would silently restore the never-forwarded failure mode this PR exists to fix (which originally shipped because of exactly this kind of coverage gap).
  • Suggested fix: Add assert.equal(dataset.sourceSignal, 'proxy') to test/core/ai-gateway-dataset.test.js, plus a chunking-test case where getDataset returns no sourceSignal and the export result is failed with the partition in retryPartitions and an "unknown signal" error.

MAX_CHUNK_BYTES bound is never exercised by the new tests

  • Severity: major
  • Confidence: 88
  • Evidence: test/plugins/central-forward-chunking.test.js:31 vs hypaware-core/plugins-workspace/central/src/sink.js:180
  • Why it matters: Test rows are ~50 bytes, so 5000-row chunks are ~250 KB and only the row bound ever trips; the byte budget — the bound that actually prevents the OOM/oversized-body scenario for wide rows (large content_text, the dataset that crashed the daemon) — is untested, including the pendingBytes reset in flushChunk and the line.length (UTF-16 units) vs UTF-8 byte approximation.
  • Suggested fix: Add a test with few but wide rows (e.g. 10 rows of ~1 MiB content_text) asserting the partition splits into multiple POSTs each under MAX_CHUNK_BYTES plus one row.

Stale "groups partitions by signal" docs left from the pre-rewrite design

  • Severity: minor
  • Confidence: 95
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:22, hypaware-core/plugins-workspace/central/proto.md:127
  • Why it matters: The PR deleted groupBySignal and forwards each partition independently in chunks, yet the createForwardSink header JSDoc still claims exportBatch "groups the driver's partitions by signal ... serializes each group as NDJSON", and proto.md's Ingest intro still says the kernel groups partitions by sourceSignal before posting — both contradicting the rewritten Batch-boundaries section in the same PR, in a repo whose CLAUDE.md mandates living docs (found independently by three reviewers).
  • Suggested fix: Rewrite the createForwardSink JSDoc summary and proto.md's Ingest intro to describe per-partition streaming in bounded chunks, one signal per POST resolved via sourceSignal.

"LLP 0001 idempotency ledger" citation points at the wrong document

  • Severity: minor
  • Confidence: 90
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:134, llp/0001-adopting-llp.plan.md:1
  • Why it matters: This repo's LLP 0001 is "Adopting Linked Literate Programming in HypAware" (a process plan); no LLP in the corpus documents a server idempotency ledger, and proto.md's convention for the server repo's corpus is the qualified form "server LLP NNNN" — so the new unqualified citation misdirects readers and violates the repo's "keep refs honest" rule (found independently by four reviewers).
  • Suggested fix: Qualify it (e.g. "server LLP 0001", after verifying that number) or drop the LLP citation and point at proto.md's X-Hyp-Batch-Id section instead.

New chunked POSTs have no chunk-level telemetry, so a failed chunk is unidentifiable

  • Severity: minor
  • Confidence: 82
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164 (no log in flushChunk), :79 (failure log lacks batch id / chunk position)
  • Why it matters: CLAUDE.md's Log-Driven Development section requires structured logs around external calls, retries, and error paths; the PR turns one POST per signal-group into N idempotent POSTs per partition keyed by X-Hyp-Batch-Id, but emits nothing per chunk, and the central.forward.failed warn carries neither the failing chunk's batch id nor how many chunks had already succeeded — exactly what's needed to debug the new dedup/retry behavior against the server ledger.
  • Suggested fix: Emit a structured debug log per chunk (e.g. central.forward.chunk with hyp_sink_signal, hyp_dataset, batch_id, rows, bytes), and include the failing chunk's batch_id plus a chunks-sent counter in the central.forward.failed warn fields.

401-refresh resend with the same X-Hyp-Batch-Id is documented but untested

  • Severity: minor
  • Confidence: 84
  • Evidence: hypaware-core/plugins-workspace/central/src/sink.js:255, test/plugins/central-forward-chunking.test.js:48
  • Why it matters: The rewritten postNdjson doc promises the 401 retry "re-send[s] the same body + key, so the retry stays idempotent", but no test issues a 401 (the harness's makeIdentity().refreshes getter is built and never asserted), so a regression that drops the header or recomputes the id on retry would pass the suite.
  • Suggested fix: Add a case whose responder returns 401 then 202, asserting identityClient.refreshes === 1, exactly two fetch calls, and identical x-hyp-batch-id and body across both.

proto.md still claims Retry-After/poison handling the client doesn't implement

  • Severity: minor
  • Confidence: 82
  • Evidence: hypaware-core/plugins-workspace/central/proto.md:153 vs hypaware-core/plugins-workspace/central/src/sink.js:294
  • Why it matters: The claim predates this PR, but the PR edits the same ingest section two paragraphs below and its chunking multiplies request volume, making the documented-but-unimplemented Retry-After handling (the PR's own admitted 429 follow-up) much more likely to bite; the doc says Retry-After "is honored when present" while the sink throws uniformly on any non-2xx.
  • Suggested fix: Mark those proto.md response-handling paragraphs as aspirational/not-yet-implemented in this PR (or implement them), so the protocol doc matches the admitted gap.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-100/dual-review/pr-100

Addresses the dual-agent review on PR #100.

sink.js
- X-Hyp-Batch-Id now keys on (signal, tablePath, chunkIndex, body)
  instead of (signal, body). Byte-identical chunks for one signal no
  longer alias onto a single server-ledger entry (which would 202-ack
  and silently drop the later chunk). Retry idempotency is preserved:
  re-streaming a partition reproduces identical boundaries, so prefix
  chunks still hash to the same ids. (Codex major.)
- Per-chunk telemetry: emit central.forward.chunk (signal, dataset,
  batch_id, chunk_index, rows, bytes); the central.forward.failed warn
  now carries the failing chunk's batch_id and a chunks_sent counter.
- Count chunk bytes in UTF-8, not UTF-16 code units, so the byte budget
  bounds real wire size for multibyte payloads.
- Fix header JSDoc (per-partition streaming, not group-by-signal) and
  the idempotency-ledger citation (server LLP 0001, not this repo's).

proto.md
- Rewrite the Ingest intro + Batch-boundaries + X-Hyp-Batch-Id text to
  describe per-partition chunked streaming and the position-keyed id.
- Add a "Client status (v1.4.x)" note: poison-drop (400/422) and
  Retry-After (429/503) are target-only; the sink retries uniformly on
  any non-2xx today.

tests
- central-forward-chunking: byte-identical chunks get distinct ids;
  no-sourceSignal dataset fails the partition with "unknown signal";
  wide rows split on the byte budget (not row count); 401 re-sends the
  same body + batch-id after one refresh; chunk + failure telemetry.
- ai-gateway-dataset: assert sourceSignal === 'proxy'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe philcunliffe merged commit 5df8533 into worktree-remote-config-join-flow Jun 13, 2026
6 checks passed
@philcunliffe philcunliffe deleted the fix/central-forward-oom branch June 13, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant