1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages by philcunliffe · Pull Request #100 · hyparam/hypaware

philcunliffe · 2026-06-12T23:43:07Z

Summary

Two bugs in the v1.4.0 @hypaware/central forward sink left it unable to forward a real backlog. Found while standing up a fleet against a live central server: a gateway joining with a large existing cache (~552k rows) OOM-crash-looped the daemon every ~71s, and even after that was fixed the AI-gateway dataset turned out to have never been forwardable.

1. Unbatched export → OOM (the crash)

central/src/sink.js materialized each partition's entire row set into an in-memory array and then rows.map(JSON.stringify).join('\n') into one giant NDJSON string. On a large backlog this exhausts the heap (FATAL ERROR: Reached heap limit); the string would also exceed V8's max length. KeepAlive restarts the daemon, which OOMs again — a loop. The recorder flushes per-exchange synchronously, so local capture/hyp query kept working, which masked it; only scheduled exports failed and the server got 0 rows.

Fix: stream each partition through bounded chunks (row-count + byte budget, both far under the server's max body), never holding the whole table. Each chunk is an independent POST carrying a content-derived X-Hyp-Batch-Id, so the server's idempotency ledger dedupes re-sends — a partition re-handed after a transport failure converges to exactly-once (the prior code sent no batch id at all → at-least-once).

2. `ai_gateway_messages` had no `sourceSignal` (masked by the OOM)

The dataset declared no sourceSignal, so the forward sink fell back to the dataset name as the signal and threw unknown signal 'ai_gateway_messages'. The v1.4.0 smoke flow only exercised a synthetic logs dataset with an explicit signal, so this was never caught — the AI-gateway dataset has never been forwardable.

Fix: sourceSignal: 'proxy' on the ai-gateway dataset registration. The server already maps proxy → ai_gateway_messages.

Validation

New unit test test/plugins/central-forward-chunking.test.js (5 cases): chunk boundaries, one-POST-per-small-partition, deterministic/idempotent batch-ids, partition-level retry on transport failure, empty-batch no-op.
Existing central + sink suites green; typecheck clean on changed files.
End-to-end against a live server: 60,645 rows forwarded and queryable on the central server, daemon stable, zero new OOM.

Notes

proto.md updated for the chunked, idempotency-keyed ingest.
Version bumped to 1.4.1.
Targeting worktree-remote-config-join-flow (the line v1.4.0 shipped from), not master — the join flow has diverged from master.

Known follow-up (not in this PR)

Bulk backfill trips the server's per-gateway byte_rate 429. The forward sink aborts the whole partition on 429 and doesn't honor Retry-After (proto.md says it should). Needs client-side pacing and/or a higher HYPSERVER_BYTE_RATE_PER_GATEWAY. New (post-join) traffic is small and unaffected.

🤖 Generated with Claude Code

…ateway_messages under proxy The central forward sink materialized each partition's full row set into one in-memory NDJSON string before POSTing. A gateway joining a fleet with a large existing cache (here ~552k rows) OOM-crash-looped the daemon on its first export — the recorder kept flushing per-exchange so local capture survived, but no scheduled export ever completed and the server received zero rows. Stream each partition through bounded chunks instead (row-count + byte budget, both far under the server's max body), never holding the whole table. Each chunk POSTs with a content-derived X-Hyp-Batch-Id so the server's idempotency ledger dedupes re-sends: when the driver re-hands a partition after a transport failure, already-delivered chunks ack 202 and only the remainder is stored, converging a partial-then-retried partition to exactly-once (the prior code sent no batch id at all — at-least-once). Also give the ai_gateway_messages dataset `sourceSignal: 'proxy'`. It had none, so the forward sink fell back to the dataset name as the signal and threw `unknown signal 'ai_gateway_messages'` — the AI-gateway dataset has never been forwardable. The OOM masked this; chunking exposed it. The server already maps the `proxy` signal to ai_gateway_messages. Validated end-to-end against a live server: 60k+ rows forwarded and queryable, daemon stable, no OOM. proto.md updated for the chunked, idempotent ingest. Adds test/plugins/central-forward-chunking.test.js. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

philcunliffe · 2026-06-13T01:45:09Z

Dual-agent review — `request_changes`

Verdict: request_changes
Risk class: medium
Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Codex	Batch-id collision for byte-identical chunks (major, sink.js:171,201,282; proto.md:144)	Wire-contract risk (server ledger not in repo); Targets: `batchIdForBody`
Claude	`sourceSignal: 'proxy'` fix has zero regression coverage (major, dataset.js:212)	Risk bullet 3; Targets: `aiGatewayDatasetRegistration`
Claude	`MAX_CHUNK_BYTES` bound never exercised (major, test:31 vs sink.js:180)	Risk bullet 2; Targets: `forwardPartition`
Claude	Stale "groups by signal" docs (minor, sink.js:22; proto.md:127)	Targets: proto.md wire contract
Claude	"LLP 0001" citation points at wrong doc (minor, sink.js:134)	Targets: sink.js
Claude	No chunk-level telemetry (minor, sink.js:164,79)	Error-path observability of the new chunk loop
Claude	401-refresh same-batch-id resend untested (minor, sink.js:255)	Wire-contract risk; Targets: `postNdjson`
Claude	proto.md claims unimplemented Retry-After handling (minor, proto.md:153)	Risk bullet 4

Codex review

Fix Validations

Central forward sink materialized whole partitions and OOMed on backlog

Status: correct
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164, hypaware-core/plugins-workspace/central/src/sink.js:180, hypaware-core/plugins-workspace/central/src/sink.js:184
Assessment: The sink now streams rows and flushes bounded chunks instead of materializing every partition row into one array/body. Existing sink-driver retry handling would not have fixed the OOM by itself because it only persists retry partitions after exportBatch returns or throws: src/core/sinks/driver.js:89, src/core/sinks/driver.js:123.

`ai_gateway_messages` forwarded under an unknown signal

Status: correct
Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:205, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:214, hypaware-core/plugins-workspace/central/src/sink.js:150
Assessment: The AI-gateway dataset now resolves to the known proxy signal, avoiding the previous fallback to ai_gateway_messages. Existing retry handling would only have retried the same unknown-signal failure.

Findings

1) Behavioral Correctness

Severity: major
Confidence: high
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:171, hypaware-core/plugins-workspace/central/src/sink.js:201, hypaware-core/plugins-workspace/central/src/sink.js:282, hypaware-core/plugins-workspace/central/proto.md:144
Why it matters: X-Hyp-Batch-Id is derived only from signal + body, so two legitimate byte-identical chunks for the same signal get the same id and the server contract says the repeated key is acked without re-storing, silently dropping the later chunk.
Suggested fix: Include stable occurrence context in the idempotency key, such as partition identity plus chunk index plus body hash, and add a regression test with two byte-identical chunks that must produce distinct batch ids while preserving retry determinism.

No Finding

Contract & Interface Fidelity
Change Impact / Blast Radius
Concurrency, Ordering & State Safety
Error Handling & Resilience
Security Surface
Resource Lifecycle & Cleanup
Release Safety
Test Evidence Quality, aside from the duplicate-chunk regression tied to finding 1
Architectural Consistency
Debuggability & Operability

Evidence Bundle

Changed hot paths: central forward sink export loop and POST path; AI-gateway dataset registration; central ingest protocol docs; package version.
Impacted callers: hypaware-core/plugins-workspace/central/index.js:63; src/core/sinks/driver.js:89; src/core/sinks/driver.js:123; src/core/sinks/driver.js:127.
Impacted tests: test/plugins/central-forward-chunking.test.js:97; test/plugins/central-forward-chunking.test.js:122; test/plugins/central-forward-chunking.test.js:132; hypaware-core/smoke/flows/central_forward_outbox.js:112.
Unresolved uncertainty: I did not inspect the server ledger implementation, only the updated protocol contract; I did not run the test suite for this review.

Claude review

Claude review

Bug #2 fix (`sourceSignal: 'proxy'`) has zero regression coverage

Severity: major
Confidence: 87
Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:212, test/core/ai-gateway-dataset.test.js:38, test/plugins/central-forward-chunking.test.js:44
Why it matters: The one-line dataset fix the PR calls load-bearing ("Without this ... AI-gateway rows never leave the gateway") is asserted nowhere — ai-gateway-dataset.test.js checks cachePartitioning but not sourceSignal, and the new chunking tests hardcode makeQuery('logs'), so neither the proxy signal path nor the unknown-signal fallback throw is ever exercised; deleting the field would silently restore the never-forwarded failure mode this PR exists to fix (which originally shipped because of exactly this kind of coverage gap).
Suggested fix: Add assert.equal(dataset.sourceSignal, 'proxy') to test/core/ai-gateway-dataset.test.js, plus a chunking-test case where getDataset returns no sourceSignal and the export result is failed with the partition in retryPartitions and an "unknown signal" error.

MAX_CHUNK_BYTES bound is never exercised by the new tests

Severity: major
Confidence: 88
Evidence: test/plugins/central-forward-chunking.test.js:31 vs hypaware-core/plugins-workspace/central/src/sink.js:180
Why it matters: Test rows are ~50 bytes, so 5000-row chunks are ~250 KB and only the row bound ever trips; the byte budget — the bound that actually prevents the OOM/oversized-body scenario for wide rows (large content_text, the dataset that crashed the daemon) — is untested, including the pendingBytes reset in flushChunk and the line.length (UTF-16 units) vs UTF-8 byte approximation.
Suggested fix: Add a test with few but wide rows (e.g. 10 rows of ~1 MiB content_text) asserting the partition splits into multiple POSTs each under MAX_CHUNK_BYTES plus one row.

Stale "groups partitions by signal" docs left from the pre-rewrite design

Severity: minor
Confidence: 95
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:22, hypaware-core/plugins-workspace/central/proto.md:127
Why it matters: The PR deleted groupBySignal and forwards each partition independently in chunks, yet the createForwardSink header JSDoc still claims exportBatch "groups the driver's partitions by signal ... serializes each group as NDJSON", and proto.md's Ingest intro still says the kernel groups partitions by sourceSignal before posting — both contradicting the rewritten Batch-boundaries section in the same PR, in a repo whose CLAUDE.md mandates living docs (found independently by three reviewers).
Suggested fix: Rewrite the createForwardSink JSDoc summary and proto.md's Ingest intro to describe per-partition streaming in bounded chunks, one signal per POST resolved via sourceSignal.

"LLP 0001 idempotency ledger" citation points at the wrong document

Severity: minor
Confidence: 90
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:134, llp/0001-adopting-llp.plan.md:1
Why it matters: This repo's LLP 0001 is "Adopting Linked Literate Programming in HypAware" (a process plan); no LLP in the corpus documents a server idempotency ledger, and proto.md's convention for the server repo's corpus is the qualified form "server LLP NNNN" — so the new unqualified citation misdirects readers and violates the repo's "keep refs honest" rule (found independently by four reviewers).
Suggested fix: Qualify it (e.g. "server LLP 0001", after verifying that number) or drop the LLP citation and point at proto.md's X-Hyp-Batch-Id section instead.

New chunked POSTs have no chunk-level telemetry, so a failed chunk is unidentifiable

Severity: minor
Confidence: 82
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:164 (no log in flushChunk), :79 (failure log lacks batch id / chunk position)
Why it matters: CLAUDE.md's Log-Driven Development section requires structured logs around external calls, retries, and error paths; the PR turns one POST per signal-group into N idempotent POSTs per partition keyed by X-Hyp-Batch-Id, but emits nothing per chunk, and the central.forward.failed warn carries neither the failing chunk's batch id nor how many chunks had already succeeded — exactly what's needed to debug the new dedup/retry behavior against the server ledger.
Suggested fix: Emit a structured debug log per chunk (e.g. central.forward.chunk with hyp_sink_signal, hyp_dataset, batch_id, rows, bytes), and include the failing chunk's batch_id plus a chunks-sent counter in the central.forward.failed warn fields.

401-refresh resend with the same X-Hyp-Batch-Id is documented but untested

Severity: minor
Confidence: 84
Evidence: hypaware-core/plugins-workspace/central/src/sink.js:255, test/plugins/central-forward-chunking.test.js:48
Why it matters: The rewritten postNdjson doc promises the 401 retry "re-send[s] the same body + key, so the retry stays idempotent", but no test issues a 401 (the harness's makeIdentity().refreshes getter is built and never asserted), so a regression that drops the header or recomputes the id on retry would pass the suite.
Suggested fix: Add a case whose responder returns 401 then 202, asserting identityClient.refreshes === 1, exactly two fetch calls, and identical x-hyp-batch-id and body across both.

proto.md still claims Retry-After/poison handling the client doesn't implement

Severity: minor
Confidence: 82
Evidence: hypaware-core/plugins-workspace/central/proto.md:153 vs hypaware-core/plugins-workspace/central/src/sink.js:294
Why it matters: The claim predates this PR, but the PR edits the same ingest section two paragraphs below and its chunking multiplies request volume, making the documented-but-unimplemented Retry-After handling (the PR's own admitted 429 follow-up) much more likely to bite; the doc says Retry-After "is honored when present" while the sink throws uniformly on any non-2xx.
Suggested fix: Mark those proto.md response-handling paragraphs as aspirational/not-yet-implemented in this PR (or implement them), so the protocol doc matches the admitted gap.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-100/dual-review/pr-100

Addresses the dual-agent review on PR #100. sink.js - X-Hyp-Batch-Id now keys on (signal, tablePath, chunkIndex, body) instead of (signal, body). Byte-identical chunks for one signal no longer alias onto a single server-ledger entry (which would 202-ack and silently drop the later chunk). Retry idempotency is preserved: re-streaming a partition reproduces identical boundaries, so prefix chunks still hash to the same ids. (Codex major.) - Per-chunk telemetry: emit central.forward.chunk (signal, dataset, batch_id, chunk_index, rows, bytes); the central.forward.failed warn now carries the failing chunk's batch_id and a chunks_sent counter. - Count chunk bytes in UTF-8, not UTF-16 code units, so the byte budget bounds real wire size for multibyte payloads. - Fix header JSDoc (per-partition streaming, not group-by-signal) and the idempotency-ledger citation (server LLP 0001, not this repo's). proto.md - Rewrite the Ingest intro + Batch-boundaries + X-Hyp-Batch-Id text to describe per-partition chunked streaming and the position-keyed id. - Add a "Client status (v1.4.x)" note: poison-drop (400/422) and Retry-After (429/503) are target-only; the sink retries uniformly on any non-2xx today. tests - central-forward-chunking: byte-identical chunks get distinct ids; no-sourceSignal dataset fails the partition with "unknown signal"; wide rows split on the byte budget (not row count); 401 re-sends the same body + batch-id after one refresh; chunk + failure telemetry. - ai-gateway-dataset: assert sourceSignal === 'proxy'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

philcunliffe and others added 2 commits June 12, 2026 15:45

Bump version to 1.4.1

c9c1098

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

philcunliffe merged commit 5df8533 into worktree-remote-config-join-flow Jun 13, 2026
6 checks passed

philcunliffe deleted the fix/central-forward-oom branch June 13, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages#100

1.4.1: fix central forward-sink OOM on backlog + unforwardable ai_gateway_messages#100
philcunliffe merged 3 commits into
worktree-remote-config-join-flowfrom
fix/central-forward-oom

philcunliffe commented Jun 12, 2026

Uh oh!

philcunliffe commented Jun 13, 2026

Fix Validations

Central forward sink materialized whole partitions and OOMed on backlog

`ai_gateway_messages` forwarded under an unknown signal

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Bug #2 fix (`sourceSignal: 'proxy'`) has zero regression coverage

MAX_CHUNK_BYTES bound is never exercised by the new tests

Stale "groups partitions by signal" docs left from the pre-rewrite design

"LLP 0001 idempotency ledger" citation points at the wrong document

New chunked POSTs have no chunk-level telemetry, so a failed chunk is unidentifiable

401-refresh resend with the same X-Hyp-Batch-Id is documented but untested

proto.md still claims Retry-After/poison handling the client doesn't implement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Jun 12, 2026

Summary

1. Unbatched export → OOM (the crash)

2. ai_gateway_messages had no sourceSignal (masked by the OOM)

Validation

Notes

Known follow-up (not in this PR)

Uh oh!

philcunliffe commented Jun 13, 2026

Dual-agent review — request_changes

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

Central forward sink materialized whole partitions and OOMed on backlog

ai_gateway_messages forwarded under an unknown signal

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Bug #2 fix (sourceSignal: 'proxy') has zero regression coverage

MAX_CHUNK_BYTES bound is never exercised by the new tests

Stale "groups partitions by signal" docs left from the pre-rewrite design

"LLP 0001 idempotency ledger" citation points at the wrong document

New chunked POSTs have no chunk-level telemetry, so a failed chunk is unidentifiable

401-refresh resend with the same X-Hyp-Batch-Id is documented but untested

proto.md still claims Retry-After/poison handling the client doesn't implement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. `ai_gateway_messages` had no `sourceSignal` (masked by the OOM)

Dual-agent review — `request_changes`

`ai_gateway_messages` forwarded under an unknown signal

Bug #2 fix (`sourceSignal: 'proxy'`) has zero regression coverage