1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages#100
Merged
philcunliffe merged 3 commits intoJun 13, 2026
Conversation
…ateway_messages under proxy The central forward sink materialized each partition's full row set into one in-memory NDJSON string before POSTing. A gateway joining a fleet with a large existing cache (here ~552k rows) OOM-crash-looped the daemon on its first export — the recorder kept flushing per-exchange so local capture survived, but no scheduled export ever completed and the server received zero rows. Stream each partition through bounded chunks instead (row-count + byte budget, both far under the server's max body), never holding the whole table. Each chunk POSTs with a content-derived X-Hyp-Batch-Id so the server's idempotency ledger dedupes re-sends: when the driver re-hands a partition after a transport failure, already-delivered chunks ack 202 and only the remainder is stored, converging a partial-then-retried partition to exactly-once (the prior code sent no batch id at all — at-least-once). Also give the ai_gateway_messages dataset `sourceSignal: 'proxy'`. It had none, so the forward sink fell back to the dataset name as the signal and threw `unknown signal 'ai_gateway_messages'` — the AI-gateway dataset has never been forwardable. The OOM masked this; chunking exposed it. The server already maps the `proxy` signal to ai_gateway_messages. Validated end-to-end against a live server: 60k+ rows forwarded and queryable, daemon stable, no OOM. proto.md updated for the chunked, idempotent ingest. Adds test/plugins/central-forward-chunking.test.js. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Codex | Batch-id collision for byte-identical chunks (major, sink.js:171,201,282; proto.md:144) | Wire-contract risk (server ledger not in repo); Targets: batchIdForBody |
| Claude | sourceSignal: 'proxy' fix has zero regression coverage (major, dataset.js:212) |
Risk bullet 3; Targets: aiGatewayDatasetRegistration |
| Claude | MAX_CHUNK_BYTES bound never exercised (major, test:31 vs sink.js:180) |
Risk bullet 2; Targets: forwardPartition |
| Claude | Stale "groups by signal" docs (minor, sink.js:22; proto.md:127) | Targets: proto.md wire contract |
| Claude | "LLP 0001" citation points at wrong doc (minor, sink.js:134) | Targets: sink.js |
| Claude | No chunk-level telemetry (minor, sink.js:164,79) | Error-path observability of the new chunk loop |
| Claude | 401-refresh same-batch-id resend untested (minor, sink.js:255) | Wire-contract risk; Targets: postNdjson |
| Claude | proto.md claims unimplemented Retry-After handling (minor, proto.md:153) | Risk bullet 4 |
Codex review
Fix Validations
Central forward sink materialized whole partitions and OOMed on backlog
- Status: correct
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164, hypaware-core/plugins-workspace/central/src/sink.js:180, hypaware-core/plugins-workspace/central/src/sink.js:184
- Assessment: The sink now streams rows and flushes bounded chunks instead of materializing every partition row into one array/body. Existing sink-driver retry handling would not have fixed the OOM by itself because it only persists retry partitions after
exportBatchreturns or throws: src/core/sinks/driver.js:89, src/core/sinks/driver.js:123.
ai_gateway_messages forwarded under an unknown signal
- Status: correct
- Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:205, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:214, hypaware-core/plugins-workspace/central/src/sink.js:150
- Assessment: The AI-gateway dataset now resolves to the known
proxysignal, avoiding the previous fallback toai_gateway_messages. Existing retry handling would only have retried the same unknown-signal failure.
Findings
1) Behavioral Correctness
- Severity: major
- Confidence: high
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:171, hypaware-core/plugins-workspace/central/src/sink.js:201, hypaware-core/plugins-workspace/central/src/sink.js:282, hypaware-core/plugins-workspace/central/proto.md:144
- Why it matters:
X-Hyp-Batch-Idis derived only fromsignal + body, so two legitimate byte-identical chunks for the same signal get the same id and the server contract says the repeated key is acked without re-storing, silently dropping the later chunk. - Suggested fix: Include stable occurrence context in the idempotency key, such as partition identity plus chunk index plus body hash, and add a regression test with two byte-identical chunks that must produce distinct batch ids while preserving retry determinism.
No Finding
- Contract & Interface Fidelity
- Change Impact / Blast Radius
- Concurrency, Ordering & State Safety
- Error Handling & Resilience
- Security Surface
- Resource Lifecycle & Cleanup
- Release Safety
- Test Evidence Quality, aside from the duplicate-chunk regression tied to finding 1
- Architectural Consistency
- Debuggability & Operability
Evidence Bundle
- Changed hot paths: central forward sink export loop and POST path; AI-gateway dataset registration; central ingest protocol docs; package version.
- Impacted callers: hypaware-core/plugins-workspace/central/index.js:63; src/core/sinks/driver.js:89; src/core/sinks/driver.js:123; src/core/sinks/driver.js:127.
- Impacted tests: test/plugins/central-forward-chunking.test.js:97; test/plugins/central-forward-chunking.test.js:122; test/plugins/central-forward-chunking.test.js:132; hypaware-core/smoke/flows/central_forward_outbox.js:112.
- Unresolved uncertainty: I did not inspect the server ledger implementation, only the updated protocol contract; I did not run the test suite for this review.
Claude review
Claude review
Bug #2 fix (sourceSignal: 'proxy') has zero regression coverage
- Severity: major
- Confidence: 87
- Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:212, test/core/ai-gateway-dataset.test.js:38, test/plugins/central-forward-chunking.test.js:44
- Why it matters: The one-line dataset fix the PR calls load-bearing ("Without this ... AI-gateway rows never leave the gateway") is asserted nowhere —
ai-gateway-dataset.test.jscheckscachePartitioningbut notsourceSignal, and the new chunking tests hardcodemakeQuery('logs'), so neither theproxysignal path nor the unknown-signal fallback throw is ever exercised; deleting the field would silently restore the never-forwarded failure mode this PR exists to fix (which originally shipped because of exactly this kind of coverage gap). - Suggested fix: Add
assert.equal(dataset.sourceSignal, 'proxy')to test/core/ai-gateway-dataset.test.js, plus a chunking-test case wheregetDatasetreturns nosourceSignaland the export result isfailedwith the partition inretryPartitionsand an "unknown signal" error.
MAX_CHUNK_BYTES bound is never exercised by the new tests
- Severity: major
- Confidence: 88
- Evidence: test/plugins/central-forward-chunking.test.js:31 vs hypaware-core/plugins-workspace/central/src/sink.js:180
- Why it matters: Test rows are ~50 bytes, so 5000-row chunks are ~250 KB and only the row bound ever trips; the byte budget — the bound that actually prevents the OOM/oversized-body scenario for wide rows (large
content_text, the dataset that crashed the daemon) — is untested, including thependingBytesreset influshChunkand theline.length(UTF-16 units) vs UTF-8 byte approximation. - Suggested fix: Add a test with few but wide rows (e.g. 10 rows of ~1 MiB
content_text) asserting the partition splits into multiple POSTs each under MAX_CHUNK_BYTES plus one row.
Stale "groups partitions by signal" docs left from the pre-rewrite design
- Severity: minor
- Confidence: 95
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:22, hypaware-core/plugins-workspace/central/proto.md:127
- Why it matters: The PR deleted
groupBySignaland forwards each partition independently in chunks, yet thecreateForwardSinkheader JSDoc still claimsexportBatch"groups the driver's partitions by signal ... serializes each group as NDJSON", and proto.md's Ingest intro still says the kernel groups partitions bysourceSignalbefore posting — both contradicting the rewritten Batch-boundaries section in the same PR, in a repo whose CLAUDE.md mandates living docs (found independently by three reviewers). - Suggested fix: Rewrite the
createForwardSinkJSDoc summary and proto.md's Ingest intro to describe per-partition streaming in bounded chunks, one signal per POST resolved viasourceSignal.
"LLP 0001 idempotency ledger" citation points at the wrong document
- Severity: minor
- Confidence: 90
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:134, llp/0001-adopting-llp.plan.md:1
- Why it matters: This repo's LLP 0001 is "Adopting Linked Literate Programming in HypAware" (a process plan); no LLP in the corpus documents a server idempotency ledger, and proto.md's convention for the server repo's corpus is the qualified form "server LLP NNNN" — so the new unqualified citation misdirects readers and violates the repo's "keep refs honest" rule (found independently by four reviewers).
- Suggested fix: Qualify it (e.g. "server LLP 0001", after verifying that number) or drop the LLP citation and point at proto.md's
X-Hyp-Batch-Idsection instead.
New chunked POSTs have no chunk-level telemetry, so a failed chunk is unidentifiable
- Severity: minor
- Confidence: 82
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164 (no log in
flushChunk), :79 (failure log lacks batch id / chunk position) - Why it matters: CLAUDE.md's Log-Driven Development section requires structured logs around external calls, retries, and error paths; the PR turns one POST per signal-group into N idempotent POSTs per partition keyed by
X-Hyp-Batch-Id, but emits nothing per chunk, and thecentral.forward.failedwarn carries neither the failing chunk's batch id nor how many chunks had already succeeded — exactly what's needed to debug the new dedup/retry behavior against the server ledger. - Suggested fix: Emit a structured debug log per chunk (e.g.
central.forward.chunkwithhyp_sink_signal,hyp_dataset,batch_id,rows,bytes), and include the failing chunk'sbatch_idplus a chunks-sent counter in thecentral.forward.failedwarn fields.
401-refresh resend with the same X-Hyp-Batch-Id is documented but untested
- Severity: minor
- Confidence: 84
- Evidence: hypaware-core/plugins-workspace/central/src/sink.js:255, test/plugins/central-forward-chunking.test.js:48
- Why it matters: The rewritten
postNdjsondoc promises the 401 retry "re-send[s] the same body + key, so the retry stays idempotent", but no test issues a 401 (the harness'smakeIdentity().refreshesgetter is built and never asserted), so a regression that drops the header or recomputes the id on retry would pass the suite. - Suggested fix: Add a case whose responder returns 401 then 202, asserting
identityClient.refreshes === 1, exactly two fetch calls, and identicalx-hyp-batch-idand body across both.
proto.md still claims Retry-After/poison handling the client doesn't implement
- Severity: minor
- Confidence: 82
- Evidence: hypaware-core/plugins-workspace/central/proto.md:153 vs hypaware-core/plugins-workspace/central/src/sink.js:294
- Why it matters: The claim predates this PR, but the PR edits the same ingest section two paragraphs below and its chunking multiplies request volume, making the documented-but-unimplemented
Retry-Afterhandling (the PR's own admitted 429 follow-up) much more likely to bite; the doc saysRetry-After"is honored when present" while the sink throws uniformly on any non-2xx. - Suggested fix: Mark those proto.md response-handling paragraphs as aspirational/not-yet-implemented in this PR (or implement them), so the protocol doc matches the admitted gap.
Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-100/dual-review/pr-100
Addresses the dual-agent review on PR #100. sink.js - X-Hyp-Batch-Id now keys on (signal, tablePath, chunkIndex, body) instead of (signal, body). Byte-identical chunks for one signal no longer alias onto a single server-ledger entry (which would 202-ack and silently drop the later chunk). Retry idempotency is preserved: re-streaming a partition reproduces identical boundaries, so prefix chunks still hash to the same ids. (Codex major.) - Per-chunk telemetry: emit central.forward.chunk (signal, dataset, batch_id, chunk_index, rows, bytes); the central.forward.failed warn now carries the failing chunk's batch_id and a chunks_sent counter. - Count chunk bytes in UTF-8, not UTF-16 code units, so the byte budget bounds real wire size for multibyte payloads. - Fix header JSDoc (per-partition streaming, not group-by-signal) and the idempotency-ledger citation (server LLP 0001, not this repo's). proto.md - Rewrite the Ingest intro + Batch-boundaries + X-Hyp-Batch-Id text to describe per-partition chunked streaming and the position-keyed id. - Add a "Client status (v1.4.x)" note: poison-drop (400/422) and Retry-After (429/503) are target-only; the sink retries uniformly on any non-2xx today. tests - central-forward-chunking: byte-identical chunks get distinct ids; no-sourceSignal dataset fails the partition with "unknown signal"; wide rows split on the byte budget (not row count); 401 re-sends the same body + batch-id after one refresh; chunk + failure telemetry. - ai-gateway-dataset: assert sourceSignal === 'proxy'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
5df8533
into
worktree-remote-config-join-flow
6 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two bugs in the v1.4.0
@hypaware/centralforward sink left it unable to forward a real backlog. Found while standing up a fleet against a live central server: a gateway joining with a large existing cache (~552k rows) OOM-crash-looped the daemon every ~71s, and even after that was fixed the AI-gateway dataset turned out to have never been forwardable.1. Unbatched export → OOM (the crash)
central/src/sink.jsmaterialized each partition's entire row set into an in-memory array and thenrows.map(JSON.stringify).join('\n')into one giant NDJSON string. On a large backlog this exhausts the heap (FATAL ERROR: Reached heap limit); the string would also exceed V8's max length. KeepAlive restarts the daemon, which OOMs again — a loop. The recorder flushes per-exchange synchronously, so local capture/hyp querykept working, which masked it; only scheduled exports failed and the server got 0 rows.Fix: stream each partition through bounded chunks (row-count + byte budget, both far under the server's max body), never holding the whole table. Each chunk is an independent POST carrying a content-derived
X-Hyp-Batch-Id, so the server's idempotency ledger dedupes re-sends — a partition re-handed after a transport failure converges to exactly-once (the prior code sent no batch id at all → at-least-once).2.
ai_gateway_messageshad nosourceSignal(masked by the OOM)The dataset declared no
sourceSignal, so the forward sink fell back to the dataset name as the signal and threwunknown signal 'ai_gateway_messages'. The v1.4.0 smoke flow only exercised a syntheticlogsdataset with an explicit signal, so this was never caught — the AI-gateway dataset has never been forwardable.Fix:
sourceSignal: 'proxy'on the ai-gateway dataset registration. The server already mapsproxy→ai_gateway_messages.Validation
test/plugins/central-forward-chunking.test.js(5 cases): chunk boundaries, one-POST-per-small-partition, deterministic/idempotent batch-ids, partition-level retry on transport failure, empty-batch no-op.Notes
proto.mdupdated for the chunked, idempotency-keyed ingest.worktree-remote-config-join-flow(the line v1.4.0 shipped from), notmaster— the join flow has diverged from master.Known follow-up (not in this PR)
Bulk backfill trips the server's per-gateway
byte_rate429. The forward sink aborts the whole partition on 429 and doesn't honorRetry-After(proto.md says it should). Needs client-side pacing and/or a higherHYPSERVER_BYTE_RATE_PER_GATEWAY. New (post-join) traffic is small and unaffected.🤖 Generated with Claude Code