Skip to content

Remote config and join flow (LLP 0025)#98

Merged
philcunliffe merged 8 commits into
masterfrom
remote-config-join-flow
Jun 12, 2026
Merged

Remote config and join flow (LLP 0025)#98
philcunliffe merged 8 commits into
masterfrom
remote-config-join-flow

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Implements the client half of centrally-managed gateway configuration, per LLP 0023 (grilled + reviewed before implementation; review record in notes-archive/llp-reviews/).

What this adds

Spec + supporting docs (first two commits)

  • LLP 0023: join sequence, config pull loop, seed-config mode, staged-restart apply semantics, hash-pinned install-on-config, last-known-good rollback with post-apply probation. Renumbered from 0022 after feat(sinks): partition iceberg exports by day with conversation sort #91 took that slot. Open questions settled at implementation: poll default 300s, probation floor 120s, max document size 1 MiB. Status flipped to Active.
  • LLP 0003 (apply engine added to core-owns), LLP 0011 (join as non-interactive entry), LLP 0017 (staged restart + restart exit code 75), central proto.md (policy tokens, running-config If-None-Match convergence semantics, hash-pinned plugins, 404 demoted to legacy).

Kernel

  • src/core/config/apply.js — the config apply engine: validate → install pinned plugins → persist to A/B slots → atomic pointer flip → staged restart; last-known-good rollback, single remembered bad etag with re-apply backoff, structured rollback reasons. Each slot carries its served etag in a sidecar written before the flip, so the document and its etag commit on the same rename in both directions (apply and rollback).
  • Narrow ctx.configControl facade (stage / confirmPoll / runningEtag), present only in daemon mode — CLI boots never fire config polls.
  • Kernel-owned probation watchdog: window max(3 × poll_interval_seconds, 120s) sized from the staged document; cleared only by a confirmed authenticated poll; expiry also evaluated at boot before plugin activation (crashloop case); orphaned markers from a crash before the flip are discarded.
  • Staged restart: DAEMON_RESTART_EXIT_CODE (75) for foreground invokers; installers already render KeepAlive / Restart=always, now pinned by tests. Sinks are closed on daemon shutdown. Restart requests that race daemon startup are parked and replayed rather than dropped.
  • Hash-pinned install-on-config through the existing LLP 0007 path (pin verified against the staged artifact before the install commits); bundled first-party plugins are version-strict with the hash check skipped.
  • hypaware join <url> [token] [--token-file|stdin] [--no-daemon] — writes the mode-0600 seed config and runs the non-interactive daemon install; a wrapper over the two existing steps, not a second code path.
  • hypaware status gains a remote-config section: probation state, last rollback + structured reason, remembered bad etag, running etag (text + JSON).

Central plugin (transport only)

  • central/src/config_client.js — the config pull loop: immediate pull on bootstrap success, steady timer, If-None-Match always reflecting the running config, 401 refresh-and-retry, legacy-404 backoff, 429/503 Retry-After + linear backoff, 1 MiB body cap.
  • Removed the never-wired config_etag_path option (the running etag is kernel-managed, read through the facade).

Testing

  • 35 new traditional tests: apply-engine state machine (18 — apply / rollback / bad-etag backoff / probation expiry at boot / watchdog / crash-orphan cleanup), pull loop (10), join (5), installer relaunch defaults + exit-code distinctness. Full suite: 865 pass / 0 fail.
  • New hermetic smoke join_flow_remote_config: join → seed boot → identity bootstrap → pull (200) → kernel apply → staged restart (exit 75) → relaunch on the served config → probation cleared by a 304, against a stub central server — asserting wire-level convergence (If-None-Match transitions), token retirement from disk, seed preserved as the rollback slot, and the config.apply / config.applied / config.probation_cleared telemetry.
  • daemon_foreground_start_stop, core_boot_noop, gateway_codex_capture all pass.
  • All 94 @ref annotations validate against the LLP corpus.

Notes for reviewers

  • ⚠️ central_forward_outbox fails identically on a clean origin/master checkout (empty ingest rows) — pre-existing, likely from feat(sinks): partition iceberg exports by day with conversation sort #91's day-partitioning change, not touched here.
  • The one deliberately open question (strict version pins for bundled plugins vs rolling kernel upgrades) stays open in LLP 0023 — strict now, relax if upgrade thrash shows up.
  • Server side (hypaware-server LLP 0009) lands separately and ships dark; nothing here blocks on it except end-to-end testing.

🤖 Generated with Claude Code

philcunliffe and others added 4 commits June 12, 2026 12:57
Specifies the client half of centrally-managed gateway configuration:
join sequence, config pull loop, seed-config mode, staged-restart apply
semantics, hash-pinned install-on-config, and last-known-good rollback
with post-apply probation.

Supporting amendments:
- LLP 0003: config apply engine added to the core-owns list
- LLP 0011: join added as a non-interactive entry point
- LLP 0017: staged restart for config replacement + installer relaunch
  requirement
- central proto.md: policy tokens, running-config If-None-Match
  convergence semantics, hash-pinned plugins, 404 demoted to legacy
- LLP 0000: subsystem map updated
- notes-archive: round-1 review record for LLP 0022

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Origin master took 0022 for iceberg-export-partitioning (#91) while the
join-flow spec was drafted on a local branch; renumber to the next free
slot and update all cross-references.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Kernel side:
- Config apply engine (src/core/config/apply.js): validate -> install
  pinned plugins -> persist to A/B slots -> atomic pointer flip ->
  staged restart, with last-known-good rollback, bad-etag re-apply
  backoff, and structured rollback reasons. The served etag lives in a
  per-slot sidecar written before the flip, so the document and its
  etag commit on the same rename in both directions.
- Narrow ctx.configControl facade (stage / confirmPoll / runningEtag)
  exposed only in daemon mode; CLI boots leave it undefined so
  transport plugins keep their pull loops off.
- Kernel-owned probation watchdog: window max(3 x poll_interval, 120s)
  sized from the staged document, cleared only by a confirmed poll,
  expiry also evaluated at boot before plugin activation (crashloop
  case), orphaned markers from a crash before the flip are discarded.
- Staged restart: DAEMON_RESTART_EXIT_CODE (75) for foreground
  invokers; service managers already relaunch via KeepAlive /
  Restart=always (now pinned by tests). Sinks are closed on shutdown
  so plugin timers stop.
- Hash-pinned install-on-config through the LLP 0007 path; pin
  verified against the staged artifact before the install commits.
  Bundled first-party plugins: version strict, hash skipped.
- Config shape: plugin entries accept version / artifact_hash / source
  pins.
- hypaware status: remote-config section (probation, last rollback +
  reason, remembered bad etag, running etag) in text and JSON.
- hypaware join <url> [token] [--token-file|stdin] [--no-daemon]:
  writes the mode-0600 seed config and runs the daemon install — a
  wrapper over the two existing steps, not a second code path.

Central plugin (transport only):
- Config pull loop (central/src/config_client.js): immediate pull on
  bootstrap success, steady timer, If-None-Match always the running
  config's etag, 401 refresh-retry, 404 legacy backoff, 429/503
  Retry-After + linear backoff, 1 MiB body cap.
- Dropped the never-wired config_etag_path option (the etag is
  kernel-managed and read through the facade).

Settled LLP 0023 open questions: poll default 300s, probation floor
120s, max document size 1 MiB. proto.md sidecar wording updated; the
restart exit code recorded in LLP 0017.

Tests: apply-engine state machine (18), pull loop (10), join command
(5), installer relaunch defaults; join_flow_remote_config hermetic
smoke drives seed -> bootstrap -> pull -> apply -> restart ->
probation clear against a stub server with convergence assertions.

Note: central_forward_outbox was already failing on origin/master
(empty ingest rows) before this change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: high
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex Install-on-config validation-before-install (major, apply.js:311/322, apply_deps.js:51-74) Risks #1; Direct callers installPinnedPlugins
Codex 1 MiB cap enforced after full buffering (major, config_client.js:146-147) Risks #3; Concurrency surface (response buffering)
Codex Poll fetch lacks abort/timeout, shutdown can hang (major, config_client.js:232-259, runtime.js:421) Risks #2; Concurrency surface (pull loop)
Claude typecheck broken: bad type-import path (major, test/core/config-apply.test.js:69) Risks #5
Claude typecheck: literal-type exit-code comparisons (minor, test/core/daemon.test.js:199) Risks #5; Direct callers DAEMON_RESTART_EXIT_CODE
Claude Pin-enforcement logic untested (major, apply_deps.js:74-141) Risks #4; Config field chain artifact_hash
Claude Inline import() types ban violated (minor, apply.js:265 et al.) Targets (new config files)
Claude Size-cap comment/spec claims pre-buffering drop (minor, config_client.js:18) Risks #3
Claude 429/503 + parseRetryAfter untested (minor, config_client.js:209-216) Config field chain config_poll_interval_seconds
Claude closeAllSinks comment contradicts lazy-refresh comments (nit, runtime.js:611-613) Direct callers closeAllSinks
Codex review

Fix Validations

No explicit bug-fix claims were in scope; reviewed as a feature implementation.

Findings

1) Behavioral Correctness

  • Severity: major
  • Confidence: high
  • Evidence: src/core/config/apply.js:311, src/core/config/apply.js:322, src/core/config/apply_deps.js:51, src/core/config/apply_deps.js:56, src/core/config/apply_deps.js:74
  • Why it matters: Install-on-config cannot reliably work for a remotely served plugin that is not already installed, because catalog-backed validation runs before the pinned install path.
  • Suggested fix: Shape-parse first, install pinned non-bundled plugins from the parsed plugin entries, rebuild the catalog, then run full validateConfig; add a test with a non-bundled, initially absent pinned plugin.

6) Security Surface

  • Severity: major
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:15, hypaware-core/plugins-workspace/central/src/config_client.js:146, hypaware-core/plugins-workspace/central/src/config_client.js:147
  • Why it matters: The advertised 1 MiB transport cap is enforced only after response.text() buffers the full body, so a bad or compromised central endpoint can still force unbounded memory use.
  • Suggested fix: Read response.body as a stream with a byte counter and cancel once the cap is exceeded; optionally pre-reject oversized Content-Length.

7) Resource Lifecycle & Cleanup

  • Severity: major
  • Confidence: medium
  • Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:232, hypaware-core/plugins-workspace/central/src/config_client.js:234, hypaware-core/plugins-workspace/central/src/config_client.js:252, hypaware-core/plugins-workspace/central/src/config_client.js:259, src/core/daemon/runtime.js:421
  • Why it matters: stop() waits for an in-flight poll, but the fetch has no abort signal or timeout; daemon shutdown or staged restart can hang indefinitely behind a stalled config GET.
  • Suggested fix: Create an AbortController per poll, pass signal to fetch, abort it in stop(), and add a bounded request timeout plus a test with a never-resolving fetch.

No Finding

  1. Contract & Interface Fidelity
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Release Safety
  6. Test Evidence Quality
  7. Architectural Consistency
  8. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: central config pull loop; kernel config apply engine; pinned plugin install deps; daemon boot/probation/restart; join CLI; remote-config status.
  • Impacted callers: hypaware-core/plugins-workspace/central/index.js:60; src/core/daemon/runtime.js:213; src/core/daemon/runtime.js:260; src/core/daemon/runtime.js:421; src/core/cli/core_commands.js:2811.
  • Impacted tests: test/core/config-apply.test.js:91; test/plugins/central-config-pull.test.js:85; test/core/join-command.test.js:41; test/core/daemon.test.js:176; hypaware-core/smoke/flows/join_flow_remote_config.js:47.
  • Unresolved uncertainty: I did not inspect validateConfig internals because of the 5-file review cap; finding 1 is based on the changed pipeline and its explicit knownPlugins catalog dependency. Tests were not run.
Claude review

Claude review

PR breaks npm run typecheck: broken type-import path in new test

  • Severity: major
  • Confidence: 95
  • Evidence: test/core/config-apply.test.js:69
  • Why it matters: The inline import('../../../collectivus-plugin-kernel-types.d.ts') resolves one directory too high (file is at test/core/, repo root is ../../), producing TS2307 and turning a previously clean npm run typecheck red (verified: 4 errors at PR head, 0 at base).
  • Suggested fix: Add PluginConfigInstance to a top-of-file @import from '../../collectivus-plugin-kernel-types.d.ts' and reference the bare name in the @param.

PR breaks npm run typecheck: literal-type exit-code comparisons flagged by tsc

  • Severity: minor
  • Confidence: 90
  • Evidence: test/core/daemon.test.js:199
  • Why it matters: DAEMON_RESTART_EXIT_CODE is inferred as the literal type 75, so !== 0 && !== 1 && !== 2 raises three TS2367 "comparison appears to be unintentional" errors, contributing to the now-failing npm run typecheck gate.
  • Suggested fix: Widen the value before comparing (e.g. const code = Number(DAEMON_RESTART_EXIT_CODE) or annotate /** @type {number} */) so the distinctness assertions typecheck.

Pin-enforcement logic in apply_deps.js has no traditional tests

  • Severity: major
  • Confidence: 85
  • Evidence: src/core/config/apply_deps.js:74-141
  • Why it matters: The apply engine tests mock installPinnedPlugins entirely, so the real deterministic pin-validation decisions — the bundled_version_mismatch strict-equality check, the satisfied-lock skip predicate, the confirm-callback hash rejection, and the remote_install_rejectedartifact_hash_mismatch errorKind mapping — are asserted nowhere, and the hash pin is the PR's core "nothing may substitute code after the config was authored" security property; CLAUDE.md requires traditional tests for config-validation logic like this.
  • Suggested fix: Add test/core/config-apply-deps.test.js covering: pinned bundled plugin with mismatched version → bundled_version_mismatch; matching bundled version accepted without install; already-installed lock entry matching version+hash skipped; hash-pin mismatch → artifact_hash_mismatch. Alternatively make the install/discover dependencies injectable like the engine's ConfigApplyDeps.

Inline import() types in new code despite explicit ban

  • Severity: minor
  • Confidence: 90
  • Evidence: src/core/config/apply.js:265, test/core/config-apply.test.js:55, test/core/config-apply.test.js:69, hypaware-core/smoke/flows/join_flow_remote_config.js:330
  • Why it matters: CLAUDE.md Code Style says "Never use inline import('...') types. Declare type imports at the top of the file with @import JSDoc comments" — these are all new files that even have @import blocks at the top, so the rule was known and inconsistently applied.
  • Suggested fix: Add ConfigApplyErrorKind to apply.js's existing top-of-file @import block; add PinnedInstallResult/PluginConfigInstance to the test's @import block; add @import { AddressInfo } from 'node:net' to the smoke flow.

Transport size-cap comment and spec claim a pre-buffering drop the code doesn't implement

  • Severity: minor
  • Confidence: 80
  • Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:18 (vs config_client.js:146-147 and llp/0023 "Config pull loop")
  • Why it matters: The comment says "an oversized body should be dropped before it is buffered whole," but the implementation calls await response.text() — buffering the entire response — before checking size, so the stated memory bound does not exist at the transport layer and the spec/code disagreement violates the "keep refs honest" discipline.
  • Suggested fix: Either stream the body and abort once 1 MiB is exceeded (or check Content-Length before reading), or soften the comment and the LLP 0023 transport parenthetical to say the cap is checked after download but before parse/stage.

429/503 throttle handling and parseRetryAfter have no tests

  • Severity: minor
  • Confidence: 80
  • Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:209-216, 268-275
  • Why it matters: test/plugins/central-config-pull.test.js covers 200/304/401/404/oversize/bad-JSON/transport-error branches but not the 429/503 branch or parseRetryAfter, pure deterministic parsing with three distinct paths where a wrong negative/NaN result would silently produce a zero-delay poll loop against a throttling server.
  • Suggested fix: Add cases: 429 with Retry-After: 7 schedules without confirming the poll and logs config_poll_throttled with retry_after_seconds: 7; 503 with an HTTP-date and with a garbage header falls back to the backoff ladder.

closeAllSinks comment contradicts the PR's own "identity refresh is lazy" comments

  • Severity: nit
  • Confidence: 85
  • Evidence: src/core/daemon/runtime.js:611-613
  • Why it matters: The new comment says the central plugin's "identity refresh timers stop in its close()", but two other comments added in the same PR (sink.js:103-105, config_client.js:38-40) state identity refresh is lazy and has no timer, sending a future maintainer hunting for a timer that does not exist.
  • Suggested fix: Reword the closeAllSinks JSDoc to mention only the config pull loop, e.g. "The central plugin's config pull loop stops in its close() (identity refresh is lazy and has no timer)".

Reports: /Users/phil/workspace/hypaware/.git/worktrees/remote-config-join-flow/dual-review/pr-98

Apply engine (LLP 0023):
- Reorder stage() to shape-check -> install pinned plugins -> full
  validation, so a served config can name a not-yet-installed plugin
  (catalog-backed validation only knows a plugin once it is installed).
  LLP 0023 install-on-config section updated with the ordering rationale.

Config pull transport:
- Enforce the 1 MiB cap before buffering: oversized Content-Length is
  rejected without reading, chunked bodies stream through a byte counter
  that cancels at the cap.
- Per-poll AbortController with a 30s request deadline covering request
  and body read; stop() aborts an in-flight poll after a 1s drain grace
  so a stalled config GET cannot wedge daemon shutdown.

Tests:
- New test/core/config-apply-deps.test.js covering the real pin
  enforcement: bundled version mismatch, bundled hash exemption,
  satisfied-lock skip, artifact hash mismatch via a local git fixture,
  and install-then-validate over a fresh catalog.
- Pull-loop tests for Content-Length pre-reject, chunked cap cancel,
  stop() abort, request deadline, 429/503 Retry-After, and
  parseRetryAfter (now exported).
- Engine ordering test (install before validate) and shape-gate test.

Typecheck and style:
- Fix broken inline type-import path and literal exit-code comparisons
  that broke npm run typecheck; replace remaining inline import() types
  with top-of-file @import blocks.
- Reword closeAllSinks JSDoc: identity refresh is lazy and has no timer.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Addressed the dual-review findings in 17665e7:

Majors

  • Install-on-config ordering (Codex [codex] Add durable cache spool #1): stage() now runs shape-check → installPinnedPlugins → full validation, so a served config can name a not-yet-installed plugin; the catalog is rediscovered fresh after install. LLP 0023's install-on-config section documents the ordering and why the shape gate + hash pin bound it. Covered by an engine ordering test and an end-to-end deps test (validateDocument rejects the plugin before install, accepts after).
  • 1 MiB cap after buffering (Codex Feature: github-install #6): the transport now pre-rejects oversized Content-Length without reading and streams chunked bodies through a byte counter that cancels at the cap. Tests cover both paths (including an endless chunked stream).
  • Poll fetch abort/timeout (Codex Feature: s3-sink #7): every poll runs under its own AbortController with a 30s deadline covering request + body read; stop() lets an in-flight poll drain for 1s then aborts, so shutdown is bounded even against a fetchFn that ignores signals. Tested with a never-resolving fetch.
  • Pin enforcement untested (Claude): new test/core/config-apply-deps.test.js exercises the real buildConfigApplyDeps — bundled version mismatch, bundled hash exemption, satisfied-lock skip, artifact_hash_mismatch via a local git fixture, disabled-entry skip.

Minors/nits

  • npm run typecheck is green again (bad import path in config-apply test, literal exit-code comparisons widened).
  • Inline import() types replaced with top-of-file @import blocks (apply.js, tests, smoke flow).
  • Size-cap comment now matches the implementation; LLP 0023 transport bullet updated, and the 30s deadline / 1s drain grace added to "Settled at implementation".
  • 429/503 + parseRetryAfter covered (delta-seconds, HTTP-date, garbage → backoff ladder; parseRetryAfter exported for unit tests).
  • closeAllSinks JSDoc reworded — identity refresh is lazy, no timer.

Verified: npm test (882 tests), npm run typecheck, npm run lint, and the join_flow_remote_config + core_boot_noop smokes all pass.

philcunliffe and others added 3 commits June 12, 2026 14:31
An unref'd timer can't fire once a wedged fetch is the only live
handle: the event loop drains, so the deadline/grace abort never
happens — in CI this surfaced as node:test cancelling the pull-loop
tests with 'Promise resolution is still pending but the event loop has
already resolved'. Both timers are cleared as soon as the poll
settles, and the loop's policy is no-unref anyway.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Master's context-graph work (#97) took the 0023 slot, so the
remote-config spec moves to llp/0024-remote-config-join-flow.spec.md.
All @ref annotations and prose references updated; context-graph's own
LLP 0023 references are untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@philcunliffe philcunliffe changed the title Remote config and join flow (LLP 0023) Remote config and join flow (LLP 0025) Jun 12, 2026
@philcunliffe philcunliffe merged commit 0603841 into master Jun 12, 2026
6 checks passed
@philcunliffe philcunliffe deleted the remote-config-join-flow branch June 12, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant