fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm#3151
fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm#3151
Conversation
…ayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
| @@ -264,6 +265,12 @@ app.post('/refresh-token', async c => { | |||
| } | |||
| process.env.GASTOWN_CONTAINER_TOKEN = body.token; | |||
There was a problem hiding this comment.
WARNING: process.env.GASTOWN_CONTAINER_TOKEN is mutated before awaitHydration(), inconsistent with the PATCH /agents/:id/model handler which gates awaitHydration() first.
During boot hydration, buildPrewarmEnv pulls GASTOWN_CONTAINER_TOKEN directly from process.env (the keys loop in process-manager.ts:2697). A concurrent /refresh-token arriving mid-hydration will overwrite the env var before hydration finishes spawning the mayor SDK server — the prewarmed server's captured token will then differ from the rest of the hydration flow's captured token local. In practice this is low-severity (the new token is still valid), but the ordering is explicitly the inverse of what the PATCH handler does and the intent of the comment on line 268 suggests the guard should come first.
Consider moving the process.env.GASTOWN_CONTAINER_TOKEN = body.token assignment to after await awaitHydration(), matching the PATCH handler pattern.
| if (ctx) { | ||
| return c.json({ success: true, ...ctx }); | ||
| } | ||
| const agentId = await town.getMayorAgentId(); |
There was a problem hiding this comment.
SUGGESTION: Double RPC call to TownDO when getMayorPrewarmContext() returns null.
getMayorPrewarmContext() already queries agents.listAgents(this.sql, { role: 'mayor' }) internally. When it returns null (either no mayor, or mayor exists but no token), this falls through to a second getMayorAgentId() call which runs the same agents.listAgents query again over another RPC hop.
You could have getMayorPrewarmContext() return a discriminated result that also carries the agentId for the fallback case, or have the worker route call only getMayorPrewarmContext() and derive the { agentId } fallback response from it. Minor since this endpoint is not on a hot path, but worth noting.
| // the promise immediately so non-hydrating containers (tests, dev) | ||
| // don't block; bootHydration replaces it on entry and resolves it on exit. | ||
| let _hydrationComplete: Promise<void> = Promise.resolve(); | ||
| let _resolveHydration: () => void = () => {}; |
There was a problem hiding this comment.
SUGGESTION: The _resolveHydration stale-capture pattern is fragile if bootHydration() is ever called more than once.
If a second bootHydration() call begins before the first completes (or immediately after in a re-hydration scenario), it overwrites _hydrationComplete with a new promise and _resolveHydration with a new resolver. Any caller that already called awaitHydration() and captured the first promise will never see it resolved (the first resolver is now orphaned — _resolveHydration points to the second promise's resolver, and the first promise's resolve is lost).
The PR's reviewer notes acknowledge this and defer it. Since bootHydration is currently single-call from main.ts, it does not bite in production today. When/if periodic re-hydration is added, the resolver should be captured as a local variable inside bootHydration() itself rather than in a module-level slot, e.g.:
export async function bootHydration(): Promise<void> {
let resolve!: () => void;
_hydrationComplete = new Promise<void>(r => { resolve = r; });
try {
await bootHydrationImpl('[boot-hydration]');
} finally {
resolve();
}
}This eliminates the _resolveHydration global entirely.
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge (1 WARNING, 2 SUGGESTIONs) Overview
Issue Details (click to expand)WARNING
SUGGESTION
Files Reviewed (7 files)
Fix these issues in Kilo Cloud Reviewed by claude-sonnet-4.6 · 1,231,594 tokens |
Summary
Three independent fixes for the
startAgentInContainertimeout regression observed after #2974, plus a tighter container-instance cap.Symptoms. Production logs were filling with two error patterns since the last
gastown-staging→mainpromotion:Root cause. The control server starts accepting requests immediately at boot (
main.ts:83), whilebootHydration()runs concurrently and serialises every registry agent + the new mayor prewarm through the globalsdkServerLock(createKilo readsprocess.cwd()/process.env). Fresh/agents/start,/refresh-token, and PATCH/agents/:id/modelrequests queued behind that work and the DO-sideAbortSignal.timeout(60s)(resp.REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.The mayor prewarm added in #3122 made things worse on two axes:
KILO_CONFIG_CONTENTfrom hardcoded model defaults, so the real/agents/startwith the user's actual model triggeredensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.GASTOWN_AGENT_ROLE,GASTOWN_AGENT_ID, andGASTOWN_TOWN_IDfrom the prewarm env.kilo servesnapshotsprocess.envat spawn, andplugin/index.ts:66keys mayor-tool registration offGASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next/agents/starthanded that defective instance back to the user — manifesting as "mayor tools became unavailable."Changes
1. Hydration gate (
control-server.ts,process-manager.ts)New
awaitHydration()exported fromprocess-manager.ts: a promise thatbootHydrationreplaces on entry and resolves in afinally. Awaited at the top of/agents/start,/refresh-token, and PATCH/agents/:id/model(before anyprocess.envmutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.2. Prewarm config matches
/agents/start(Town.do.ts,gastown.worker.ts,process-manager.ts)New
getMayorPrewarmContext()onTownDOreturns{ agentId, model, smallModel, kilocodeToken, organizationId }resolved the same way_ensureMayorresolves them (config.resolveModel(townConfig, null, 'mayor')). The/api/towns/:townId/mayor-idendpoint now returns that whole context so the container builds aKILO_CONFIG_CONTENTbyte-identical to what the next/agents/startwill send. Falls back to the bare{ agentId }shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.3. Mayor workdir + plugin env (
agent-runner.ts,process-manager.ts)ensureMayorWorkspaceForTown(townId)soprewarmMayorSDKmaterialises the workspace beforeensureSDKServer'sprocess.chdir(was throwingENOENTon cold containers).buildPrewarmEnvnow mirrors the mayor-shaped subset ofbuildAgentEnv:GASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,KILOCODE_FEATURE='gastown',KILO_TEST_HOME,XDG_DATA_HOME. New end-to-end test interceptscreateKiloand asserts those keys are visible to the spawn.4.
wrangler.jsoncLowered
TownContainerDO.max_instancesfrom 800 → 500 (manual change).Verification
bootHydrationis in flight, release when it returns).bootHydrationwith a/mayor-idfetch mock, interceptscreateKilo, assertsGASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,GASTOWN_CONTAINER_TOKEN, and a non-emptyKILO_CONFIG_CONTENTare all visible at spawn time)._ensureMayormodel-resolution path to confirmresolveModel(townConfig, null, 'mayor')is byte-identical to what/agents/startwill send (mayor role ignoresrigOverrideentirely inconfig.resolveModel).mayor.ensure_decision: short_circuit_warmandagent.startup_phaseafter merge.Visual Changes
N/A
Reviewer Notes
/api/towns/:townId/mayor-idresponse shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape and the legacy{ agentId }shape with.passthrough(), and rolls back to "skip prewarm" on missing fields.organizationIdfallback chain inbuildPrewarmEnvdistinguishesundefined(older worker, fall back toprocess.env) fromnull(worker authoritatively says "no org") so a stale env-var value can't override an authoritativenull.bootHydrationis currently single-call frommain.ts. If we ever add periodic re-hydration, the resolver capture should move to a local insidebootHydration(called out in code review as a SUGGESTION, deferred).prewarmMayorSDKwarns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10mssetTimeout(test still validates the positive case deterministically).