Skip to content

fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm#3151

Open
jrf0110 wants to merge 1 commit intomainfrom
gastown-staging
Open

fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm#3151
jrf0110 wants to merge 1 commit intomainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 9, 2026

Summary

Three independent fixes for the startAgentInContainer timeout regression observed after #2974, plus a tighter container-instance cap.

Symptoms. Production logs were filling with two error patterns since the last gastown-stagingmain promotion:

[<DOMAIN>] startAgentInContainer: EXCEPTION for agent <UUID>: TimeoutError: The operation was aborted due to timeout
timeout after 6000ms: ensureSDKServer for <agentId>

Root cause. The control server starts accepting requests immediately at boot (main.ts:83), while bootHydration() runs concurrently and serialises every registry agent + the new mayor prewarm through the global sdkServerLock (createKilo reads process.cwd()/process.env). Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) (resp. REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.

The mayor prewarm added in #3122 made things worse on two axes:

  1. It built KILO_CONFIG_CONTENT from hardcoded model defaults, so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.
  2. It was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID from the prewarm env. kilo serve snapshots process.env at spawn, and plugin/index.ts:66 keys mayor-tool registration off GASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user — manifesting as "mayor tools became unavailable."

Changes

1. Hydration gate (control-server.ts, process-manager.ts)

New awaitHydration() exported from process-manager.ts: a promise that bootHydration replaces on entry and resolves in a finally. Awaited at the top of /agents/start, /refresh-token, and PATCH /agents/:id/model (before any process.env mutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts)

New getMayorPrewarmContext() on TownDO returns { agentId, model, smallModel, kilocodeToken, organizationId } resolved the same way _ensureMayor resolves them (config.resolveModel(townConfig, null, 'mayor')). The /api/towns/:townId/mayor-id endpoint now returns that whole context so the container builds a KILO_CONFIG_CONTENT byte-identical to what the next /agents/start will send. Falls back to the bare { agentId } shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)

  • Exported ensureMayorWorkspaceForTown(townId) so prewarmMayorSDK materialises the workspace before ensureSDKServer's process.chdir (was throwing ENOENT on cold containers).
  • buildPrewarmEnv now mirrors the mayor-shaped subset of buildAgentEnv: GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, KILOCODE_FEATURE='gastown', KILO_TEST_HOME, XDG_DATA_HOME. New end-to-end test intercepts createKilo and asserts those keys are visible to the spawn.

4. wrangler.jsonc

Lowered TownContainerDO.max_instances from 800 → 500 (manual change).

Verification

  • Unit-tested the hydration gate end-to-end with a fetch barrier (asserts awaiters block while bootHydration is in flight, release when it returns).
  • Unit-tested the prewarm env shape end-to-end (drives bootHydration with a /mayor-id fetch mock, intercepts createKilo, asserts GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, GASTOWN_CONTAINER_TOKEN, and a non-empty KILO_CONFIG_CONTENT are all visible at spawn time).
  • Reviewed the _ensureMayor model-resolution path to confirm resolveModel(townConfig, null, 'mayor') is byte-identical to what /agents/start will send (mayor role ignores rigOverride entirely in config.resolveModel).
  • Manual production verification deferred — these changes target a hot path that's hard to reproduce locally; will monitor Sentry / AE mayor.ensure_decision: short_circuit_warm and agent.startup_phase after merge.

Visual Changes

N/A

Reviewer Notes

  • The /api/towns/:townId/mayor-id response shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape and the legacy { agentId } shape with .passthrough(), and rolls back to "skip prewarm" on missing fields.
  • The organizationId fallback chain in buildPrewarmEnv distinguishes undefined (older worker, fall back to process.env) from null (worker authoritatively says "no org") so a stale env-var value can't override an authoritative null.
  • The hydration gate is a single global promise — bootHydration is currently single-call from main.ts. If we ever add periodic re-hydration, the resolver capture should move to a local inside bootHydration (called out in code review as a SUGGESTION, deferred).
  • Two SUGGESTION-level findings deferred from code review: (a) prewarmMayorSDK warns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10ms setTimeout (test still validates the positive case deterministically).

…ayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
@@ -264,6 +265,12 @@ app.post('/refresh-token', async c => {
}
process.env.GASTOWN_CONTAINER_TOKEN = body.token;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: process.env.GASTOWN_CONTAINER_TOKEN is mutated before awaitHydration(), inconsistent with the PATCH /agents/:id/model handler which gates awaitHydration() first.

During boot hydration, buildPrewarmEnv pulls GASTOWN_CONTAINER_TOKEN directly from process.env (the keys loop in process-manager.ts:2697). A concurrent /refresh-token arriving mid-hydration will overwrite the env var before hydration finishes spawning the mayor SDK server — the prewarmed server's captured token will then differ from the rest of the hydration flow's captured token local. In practice this is low-severity (the new token is still valid), but the ordering is explicitly the inverse of what the PATCH handler does and the intent of the comment on line 268 suggests the guard should come first.

Consider moving the process.env.GASTOWN_CONTAINER_TOKEN = body.token assignment to after await awaitHydration(), matching the PATCH handler pattern.

if (ctx) {
return c.json({ success: true, ...ctx });
}
const agentId = await town.getMayorAgentId();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Double RPC call to TownDO when getMayorPrewarmContext() returns null.

getMayorPrewarmContext() already queries agents.listAgents(this.sql, { role: 'mayor' }) internally. When it returns null (either no mayor, or mayor exists but no token), this falls through to a second getMayorAgentId() call which runs the same agents.listAgents query again over another RPC hop.

You could have getMayorPrewarmContext() return a discriminated result that also carries the agentId for the fallback case, or have the worker route call only getMayorPrewarmContext() and derive the { agentId } fallback response from it. Minor since this endpoint is not on a hot path, but worth noting.

// the promise immediately so non-hydrating containers (tests, dev)
// don't block; bootHydration replaces it on entry and resolves it on exit.
let _hydrationComplete: Promise<void> = Promise.resolve();
let _resolveHydration: () => void = () => {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: The _resolveHydration stale-capture pattern is fragile if bootHydration() is ever called more than once.

If a second bootHydration() call begins before the first completes (or immediately after in a re-hydration scenario), it overwrites _hydrationComplete with a new promise and _resolveHydration with a new resolver. Any caller that already called awaitHydration() and captured the first promise will never see it resolved (the first resolver is now orphaned — _resolveHydration points to the second promise's resolver, and the first promise's resolve is lost).

The PR's reviewer notes acknowledge this and defer it. Since bootHydration is currently single-call from main.ts, it does not bite in production today. When/if periodic re-hydration is added, the resolver should be captured as a local variable inside bootHydration() itself rather than in a module-level slot, e.g.:

export async function bootHydration(): Promise<void> {
  let resolve!: () => void;
  _hydrationComplete = new Promise<void>(r => { resolve = r; });
  try {
    await bootHydrationImpl('[boot-hydration]');
  } finally {
    resolve();
  }
}

This eliminates the _resolveHydration global entirely.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 9, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge (1 WARNING, 2 SUGGESTIONs)

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 2
Issue Details (click to expand)

WARNING

File Line Issue
services/gastown/container/src/control-server.ts 266 process.env.GASTOWN_CONTAINER_TOKEN mutated before awaitHydration() in /refresh-token, inconsistent with PATCH /agents/:id/model which gates first. Mid-hydration token refresh can cause buildPrewarmEnv to pick up a different token than the one hydration captured locally.

SUGGESTION

File Line Issue
services/gastown/src/gastown.worker.ts 688 Double RPC call to TownDO when getMayorPrewarmContext() returns null — falls through to getMayorAgentId() which re-runs the same SQL query over another RPC hop.
services/gastown/container/src/process-manager.ts 81 _resolveHydration module-global stale-capture: a second concurrent bootHydration() call would orphan the first promise's resolver. Deferred per reviewer notes; consider removing the global by capturing resolve as a local inside bootHydration().
Files Reviewed (7 files)
  • services/gastown/container/src/agent-runner.ts — no issues
  • services/gastown/container/src/control-server.ts — 1 issue
  • services/gastown/container/src/process-manager.test.ts — no issues
  • services/gastown/container/src/process-manager.ts — 1 issue
  • services/gastown/src/dos/Town.do.ts — no issues
  • services/gastown/src/gastown.worker.ts — 1 issue
  • services/gastown/wrangler.jsonc — no issues

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 1,231,594 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant