feat: add hub-managed port pool for agent containers#1
Closed
zeroasterisk wants to merge 75 commits into
Closed
Conversation
…rm#293) * fix(scion-chat-app): set channel="gchat" on ask_user dialog responses handleDialogSubmit was using the simple SendMessage API which doesn't support structured message fields, so inbound ask_user responses arrived at the hub with no channel set (defaulting to "web"). Switch to SendStructuredMessage with Channel="gchat" to match the pattern already used by cmdMessage. * fix: channel filtering and thread-id routing for chat channel replies Two bugs in the chat channel routing feature: 1. Channel filtering: broker plugins now check msg.Channel and skip messages targeted at a different channel. The hub injects plugin_name into broker credentials so each plugin knows its own channel identity. This prevents cross-channel delivery (e.g., Telegram replies leaking to Google Chat). 2. Thread-id routing: the Telegram plugin now passes msg.ThreadID as message_thread_id to the Telegram Bot API when sending outbound messages. Previously, thread-id was captured on inbound messages but never forwarded on outbound, causing replies to land in the wrong forum topic. Added SendOption variadic parameter to SendMessage, SendMessageWithKeyboard, and SendQueue.Send for backward-compatible thread-id support. * feat(scion-chat-app): add Google Chat thread context support Propagate thread IDs end-to-end so agents can participate in Google Chat threads: - Inbound: auto-set ThreadID on StructuredMessage from the Google Chat event's thread context when no explicit --thread flag is used - Inbound: propagate ThreadID on dialog submit (ask_user responses) - Outbound: pass ThreadID from StructuredMessage to SendMessageRequest so agent replies land in the correct Google Chat thread * fix: route outbound messages to chat-app via ChannelID The FanOutEventBus matched msg.Channel against the bus Name, but the chat-app plugin is registered as "chat-app" while its messages use channel="gchat". Add a ChannelID field to NamedEventBus and PluginInfo so plugins can declare the channel they handle independently of their registered name. The chat-app now reports ChannelID="gchat" via GetInfo(), and the hub reads it at startup to wire routing correctly. * design: per-topic /default agent scoping for Telegram forums Explores how to let /default set a different default agent per forum topic (message_thread_id) rather than per-chat. Conclusion: ~85 lines of changes across store, commands, callbacks, and routing. * feat(scion-telegram): per-topic /default agent scoping for forum groups Add support for setting a different default agent per Telegram forum topic/thread, with the chat-wide default as fallback. - New topic_defaults table keyed on (chat_id, thread_id) - /default in a topic sets/shows the topic-level override - Callback data extended: dflt:<slug>:<threadID> for topic scope - Routing resolves topic default before chat default for both @bot-mention and unaddressed message fallback paths * fix: address PR GoogleCloudPlatform#293 review feedback - Add !no_sqlite build tag to resource_import_handler_test.go to fix CI vet failure (mockRoundTripper undefined when template_bootstrap_test.go is excluded) - Guard debug log in broker.go Publish against nil msg to prevent panic - Add fitCallback to preserve threadID suffix in Telegram callback_data when the 64-byte limit is exceeded, truncating agentSlug instead - Add slog warning to truncateCallback when truncation occurs * fix: address second round of PR GoogleCloudPlatform#293 review feedback - Remove redundant channel filters from chat-app and Telegram Publish() methods — the FanOutEventBus already routes by ChannelID, and comparing against the plugin's registered name would silently drop messages - Log errors from GetTopicDefault instead of silently ignoring them - Return distinct error messages in chat-app when ResolveOrAutoRegister fails with a real error vs a nil mapping * fix: address third round of PR GoogleCloudPlatform#293 review feedback - Add early return for nil msg at top of Publish() to prevent panics in downstream handlers that dereference msg fields - Add thread-safe ChannelName() getter on BrokerServer - Use dynamic ChannelName() in GetInfo() instead of hardcoded "gchat" - Use dynamic ChannelName() in both commands.go call sites * fix: use callback_lookups for long callback data instead of truncation Replace fitCallback() which corrupted agent slugs by truncating them to fit Telegram's 64-byte limit. Long callback payloads are now stored in the callback_lookups table with a short cblu:<id> reference. HandleCallback resolves lookup IDs before routing. Also add defensive check for empty HubUserEmail in chat-app to prevent constructing invalid "user:" sender strings. * fix: address fifth round of PR GoogleCloudPlatform#293 review feedback - Use local interface instead of concrete *BrokerRPCClient type assertion in pluginChannelID() and isObserverBroker() so in-process brokers and mocks are handled correctly. - Add nil guard for msg in fanout channel routing check. --------- Co-authored-by: Scion <agent@scion.dev>
…eCloudPlatform#296) * Fix test suite leaking Hub credentials, corrupting agent state (GoogleCloudPlatform#123) Tests that spawn sciontool (e.g., TestInitCommand_Integration) inherited live Hub env vars from the agent container, causing the subprocess to talk to the real Hub and reset the agent phase to "starting." - Add scrubHubEnv(t) helpers that use t.Setenv to clear Hub env vars (SCION_HUB_ENDPOINT, SCION_HUB_URL, SCION_AUTH_TOKEN, SCION_AGENT_ID, SCION_AGENT_MODE) with automatic restore on test cleanup - Filter Hub env vars from subprocess Cmd.Env in TestInitCommand_Integration as belt-and-suspenders protection - Convert os.Setenv/os.Unsetenv to t.Setenv throughout hub_test.go and client_test.go for crash-safe env var isolation * Add project log entry for issue GoogleCloudPlatform#123 fix * Address PR GoogleCloudPlatform#296 review feedback in init_test.go Replace hardcoded /tmp/sciontool-test path with t.TempDir() to avoid permission conflicts and test races. Replace map allocation in filterHubEnv with slices.Contains on the static hubEnvVars slice.
…oogleCloudPlatform#299) Three new documentation pages: - External Channels: covers Telegram (bidirectional group chat), Discord (outbound webhooks), and A2A protocol bridge in one page. Summarizes concepts and links to detailed READMEs in extras/. - Hub Setup on GCE: step-by-step walkthrough of deploying a hub using the starter-hub scripts. Covers provisioning, repo setup, TLS, and post-setup next steps. - Multi-Broker Setup: how to connect multiple machines to a single hub for distributed agent execution. Covers architecture, broker registration, selection, and cross-broker considerations. Sidebar updated to include all three pages.
* Add sort and filter capabilities to agent list view (GoogleCloudPlatform#71) CLI: add --phase, --activity, --template filter flags and --sort, --reverse sort flags to 'scion list'. Validates flag values against known phases/activities. Passes phase filter server-side in hub mode for efficiency. Web UI: add phase filter chips (All/Running/Stopped/Suspended/Error), sortable table headers (Name, Status, Updated), and sort dropdown for grid view. Filter and sort state persists to localStorage. Closes GoogleCloudPlatform#71 * Address review feedback: input canonicalization and validation - CLI: canonicalize --phase/--activity/--sort to lowercase in validateListFlags, remove redundant empty check on filterActivity - Web UI: validate localStorage phase filter against known values instead of raw cast - Web UI: validate localStorage sort config field/dir values before applying - Web UI: handle invalid date strings in formatRelativeTime with isNaN guard
…rm#295) * Add prominent disconnected overlay to web terminal When the WebSocket connection drops, a full-terminal overlay now appears with 50% black opacity and large red "DISCONNECTED" text centered on it. The overlay appears immediately on disconnect and disappears when the connection is re-established. The small status indicator in the toolbar remains as a secondary signal. Fixes GoogleCloudPlatform#77 * Move disconnected overlay to be a sibling of xterm container The overlay was a child of .terminal-container, whose DOM is managed by xterm.js. Lit re-rendering the overlay on connect/disconnect state changes conflicts with xterm's DOM management. Fix: introduce .terminal-wrapper as the relative-positioning context, make .terminal-container absolutely positioned inside it, and render the overlay as a sibling — outside xterm's managed subtree. * Use wasConnected flag instead of terminal ref for overlay reactivity Replace the non-reactive `this.terminal` reference in the overlay condition with a new `@state() wasConnected` flag. This fixes two issues: 1. Lit reactivity: `this.terminal` lacked `@state()` so changes to it didn't trigger re-renders. The new `wasConnected` is properly decorated as reactive state. 2. Initial connection: using `this.terminal` would flash the overlay during the brief window between terminal init and WebSocket open. `wasConnected` is only set true after the first successful connect, so the overlay only appears after a genuine disconnection.
…tore port, LISTEN/NOTIFY (GoogleCloudPlatform#304) * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * feat(observability): add Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics (P0-5) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. # Conflicts: # pkg/hub/handlers_project_test.go # pkg/hub/httpdispatcher_test.go * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * test(secret): map non-UUID fixture IDs to UUIDs via tid() Apply the tid() helper to pkg/secret fixtures (including a dynamically built secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green. * test(cmd): map non-UUID fixture IDs to UUIDs via tid(); add broker slug/name Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor broker_name to satisfy Ent validators. cmd now green except TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary (absent in this sandbox) and is unrelated to the store migration. # Conflicts: # cmd/server_dispatcher_test.go * test(hub): wrap remaining latent non-UUID fixture IDs Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz, agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the remaining ~56 failures are behavioral (URL paths built from old raw IDs, assertion mismatches), addressed next. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): fix bootstrap sync-to-finalize agent paths and storage keys Build the finalize request path from the agent's tid() UUID and seed mock storage under WorkspaceStoragePath(projectID, agent.ID) — the handler derives the workspace key from the agent's real id, not the old raw name. pkg/hub 23 -> 19. * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(entadapter)+test(hub): FK error mapping + permissions FK fixtures mapError now distinguishes foreign-key violations (-> ErrInvalidInput, a bad reference) from unique-constraint violations (-> ErrAlreadyExists); previously both surfaced as a misleading 'already exists'/409. Seed the users/agents that group memberships and policy bindings reference (the Ent store enforces user/agent FK edges the raw store lacked), wrap remaining raw fixture/URL ids in tid(), and give the AddAgent fixtures slugs. All pkg/hub permissions tests pass. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * test(hub): use tid() in principal/agent URL paths; broker slug in template_bootstrap * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): start dispatcher/broker for any subscription-capable EventPublisher Wave C integration: newEventPublisher can now return a PostgresEventPublisher (LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker startup previously hard-asserted *ChannelEventPublisher, which silently skipped starting them under Postgres. Gate on (not noop and not nil) instead, matching the existing pattern in handlers_messages.go. * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * docs: add multi-node broker dispatch and NFS workspace designs - broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for cross-replica broker command routing and agent lifecycle dispatch - nfs-workspace.md: NFS workspace coordination for VM (host bind-mount) and K8s/Cloud Run (per-pod mount) runtime models * fix(store): address PR GoogleCloudPlatform#304 review — context leaks and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms. * docs: add project log for PR GoogleCloudPlatform#304 review fixes * fix(store): context leak in legacy migration & double file: prefix 1. Thread the server's cancellable context through maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C during first-boot legacy migration aborts it instead of running with an uncancellable context.Background(). 2. Guard against a double "file:" prefix when constructing the SQLite DSN. If the operator's database.url already starts with "file:", we no longer blindly prepend another "file:" prefix. Also correctly appends cache=shared with "&" when the DSN already contains query parameters. * fix(store): rename ProjectTypeHubNative → ProjectTypeHubManaged (rebase fixup) Upstream renamed hub-native to hub-managed while the PR was in flight. Update the two remaining references that the rebase conflict resolution missed. --------- Co-authored-by: Scion <agent@scion.dev>
…t token TestClient_StartTokenRefresh exercised RefreshToken -> WriteTokenFile without isolating the token home, so running the suite inside a live agent container overwrote the real ~/.scion/scion-token with the test stub "refreshed-token". Every subsequent Hub call then 401'd with "compact JWS format must have three parts" / "unrecognized token format". - Add SetTokenHome(t.TempDir()) to the test, matching its siblings. - Guard WriteTokenFile: panic under `go test` unless SetTokenHome was called, so a forgotten isolation can never corrupt live state again. Reads remain unguarded (harmless; return empty when absent).
…ecycle + message routing (GoogleCloudPlatform#305) * Add canonical engineering glossary (GLOSSARY.md) (#102) * Add engineering glossary (GLOSSARY.md) with canonical terms and cleanup tracker Add a root-level GLOSSARY.md capturing canonical Scion terminology in the ubiquitous-language format (preferred term + synonyms to avoid), grouped by domain cluster, plus an Exceptions & Future Cleanup section tracking known naming-convergence work. Link it from agents.md as the canonical engineering glossary. * Revise glossary: broker reframe, Event Bus, Hub-managed, and term refinements Refine entries from review: redefine Message Broker as the pluggable messaging-integration system (add Broker plugin, Built-in broker); add Event Bus for the NATS real-time/event capability; collapse hub-native/Hub Workspace into Hub-managed project/workspace; tighten Template (harness-agnostic, optional default harness-config), Skill (template-only, Agent Skills link), Profile (named runtime-broker settings bundle), Harness/Harness-config; reframe Hub as the control plane in both modes; add Group and Message Group. Expand Exceptions & Future Cleanup to nine tracked items. * Glossary: restructure headings, add cross-refs, modes table, and new terms - Retitle to "Scion Glossary"; drop the "Language" wrapper and promote the thematic categories to top-level sections - Add an Operations section (Attach, Dispatch) and move Profile next to Runtime Broker - Add a Local/Workstation/Hosted comparison table and "See also" cross-refs across the main confusable term clusters - Reframe the intro around the three-way broker collision (incl. Event Bus) and defer to the disambiguation rule; sentence-case "Shared directory" - Add canonical entries for Secret, Notification, and Schedule - Add a "Potential Future Additions" section cataloguing candidate terms * Glossary: remove Exceptions & Future Cleanup tracker The cleanup items are now tracked by dedicated agents that open GitHub issues and implementation PRs, so the staged tracker no longer lives in the glossary. Reword the two intro/disambiguation references that pointed at the removed section to point at GitHub issues instead. --------- Co-authored-by: Preston Holmes <ptone@google.com> * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * feat(hub): per-process instanceID on Server (B1-1) Add a unique per-process instanceID to Server, generated at construction via uuid.NewString(). Optionally prefixed with POD_NAME env var for log readability, but uniqueness is always guaranteed by the UUID. This ID serves as the affinity key for broker dispatch (design §4.1) and is intentionally distinct from config.ResolveHubID, which is shareable across replicas. * feat(schema): affinity columns on runtime_brokers (B1-2) Add 3 nullable fields to the runtime_brokers ent schema and store model for tracking which hub instance holds the control-channel socket: - connected_hub_id (TEXT, optional/nullable) - connected_session_id (TEXT, optional/nullable) - connected_at (TIMESTAMPTZ, optional/nullable) Dialect-neutral (no Postgres-only annotations) — AutoMigrate works on both SQLite and CloudSQL Postgres per postgres-strategy.md §6.4. Wire the fields through the ent<->store conversion code in both directions (entBrokerToStore, CreateRuntimeBroker, UpdateRuntimeBroker). Regenerated ent code included. * feat(store): Claim/Release runtime-broker affinity CAS methods (B1-3) Mirrors UpdateRuntimeBrokerHeartbeat's lock_version CAS loop. - ClaimRuntimeBrokerConnection: newest-wins, sets affinity + status=online + heartbeat in one write - ReleaseRuntimeBrokerConnection: compare-and-clear, returns cleared=false (no-op) if affinity moved (disconnect-race fix) Tests cover claim/overwrite/clear/no-op + A->B flap (design 9.4). * fix(hub): thread sessionID through connect + fix onDisconnect clobber race (B1-4, B1-5) B1-4: HandleUpgrade returns sessionID; markBrokerOnline(brokerID, sessionID) now calls ClaimRuntimeBrokerConnection(brokerID, instanceID, sessionID), recording affinity + online + heartbeat in one CAS write. B1-5: SetOnDisconnect callback gains sessionID; the handler compare-and-clears via ReleaseRuntimeBrokerConnection and skips the offline stamp when affinity has moved (flap). removeConnection now only removes/fires for the matching session, so an old connection's teardown can't drop a newer live socket. * feat(schema): broker_dispatch intent table + messages dispatch-state (B2-1, B2-2) B2-1: new BrokerDispatch ent entity (table broker_dispatch) — id, broker_id, agent_id(null), agent_slug, project_id(null), op, args(JSON), state, result, claimed_by, attempts, error, created_at/updated_at, deadline_at(null); index (broker_id,state). store.BrokerDispatch model + state constants. B2-2: messages.dispatch_state (default 'pending') + dispatched_at; wired through store.Message + entadapter conversion/create. Dialect-neutral. * feat(hub): PostgresCommandBus LISTEN/NOTIFY signal listener on scion_broker_cmd (B2-4) Introduce a CommandBus interface and PostgresCommandBus implementation that listens on the new global channel scion_broker_cmd for broker dispatch wakeup signals. This is a sibling of PostgresEventPublisher, reusing the same connect/reconnect/keepalive helpers but maintaining its own independent pgx connection and pool (design §5.1). Key components: - PostgresCommandBus: LISTEN loop with backoff-reconnect on its own dedicated connection; filters signals by local broker ownership via an injected ownsLocally func (wired to ControlChannelManager.IsConnected); invokes an injected onSignal reconcile callback (to be wired to the reconcile drain in B2-5). - NotifyBrokerCmd: issues NOTIFY inside the caller's transaction so the signal commits atomically with the durable intent row (mirrors PublishTx). - NoopCommandBus: safe no-op for the SQLite backend (single-process, all brokers are local). - Backend selection in newCommandBus mirrors newEventPublisher: Postgres driver → PostgresCommandBus; otherwise → NoopCommandBus. - Server.SetCommandBus/CommandBus() setter/getter; cleanup in both Shutdown and CleanupResources paths. * feat(store): BrokerDispatch store methods + message dispatch CAS (B2-3) BrokerDispatchStore: Insert/Claim(CAS pending->in_progress)/Complete/Fail/ ListPendingDispatch + MarkMessageDispatched(CAS)/ListPendingMessages (via agent runtime_broker_id). Wired into CompositeStore + store.Store. Tests: concurrent claim single-winner (exactly-once), drain pending-only, message CAS dedupe, complete/fail transitions, pending-messages-by-broker-agent. * feat(hub): reconcile-on-connect drain wired to bus + markBrokerOnline (B2-5) Server.reconcileBroker drains pending broker_dispatch rows (CAS-claim -> exec -> done/fail) and pending messages (CAS MarkMessageDispatched -> deliver) for a broker this node owns. Exactly-once via store CAS; idempotent + concurrent-safe. Wired as durability backstop into markBrokerOnline (async on reconnect) and as the command-bus signal handler (SetOnSignal -> ReconcileBroker). Op executors are seams (executeDispatch/deliverMessage) that Phase 3/4 fill with local tunnel ops. * feat(hub): route() decision in HybridBrokerClient (B3-1) routeLocal (IsConnected, unchanged fast path) | routeForward (affinity owner alive) | routeHTTP (broker endpoint set) | routeUndeliverable. Affinity is a hint only (StoreAffinityLookup over connected_hub_id + last_heartbeat freshness), injectable for testing. Not yet wired into dispatch (B3-2 wires message path). Table-driven tests over all branches incl. local-precedence + nil-affinity. * feat(hub): cross-node message dispatch via route()+intent+signal+owner drain (B3-2, B3-3) Route-gate the message send path: HybridBrokerClient.MessageAgent now uses route(brokerID, endpoint) to decide delivery. routeLocal and routeHTTP follow existing paths unchanged. routeForward/routeUndeliverable return ErrMessageDeferred — the message row (already persisted with dispatch_state=pending) is the durable intent. All call sites (handleAgentMessage, set[], broadcastDirect, messagebroker, notifications, scheduler) catch the sentinel, emit a best-effort NOTIFY wakeup via SignalBrokerCmd, and return 202 Accepted (or log as deferred). Fill the deliverMessage seam in reconcile.go: resolves the agent from the message's AgentID, obtains the dispatcher, and calls DispatchAgentMessage for local tunnel delivery. reconcileBroker already CAS-marks dispatched before calling this. Wire SetAffinityLookup(StoreAffinityLookup(store, 0)) on the HybridBrokerClient in CreateAuthenticatedDispatcher so route() can return routeForward when another node owns the broker. Add SignalBrokerCmd to the CommandBus interface — a best-effort NOTIFY using the bus's own pool, used by the message path where the durable intent is the message row itself and the NOTIFY is only a wakeup hint. * feat(hub): lifecycle dispatch (rolling-timeout wait + cross-node start/stop/restart) (B4-1, B4-2) B4-1: Rolling-timeout wait helper (dispatch_wait.go) - waitForAgentTransition subscribes to agent.<id>.status events and loops with a rolling window (dispatchRollingTimeout=90s) that resets on ANY AgentStatusEvent (phase/activity/detail change). - Terminal phase → return phase, nil. Window expiry → ErrDispatchFailed. Context cancellation → ctx.Err(). - Caller subscribes BEFORE writing intent, passes the channel + unsub. B4-2: Cross-node start/stop/restart dispatch - Route-gated HybridBrokerClient.StartAgent/StopAgent/RestartAgent exactly like MessageAgent: routeLocal → control-channel tunnel (unchanged fast path), routeHTTP → HTTP fallback, routeForward/routeUndeliverable → ErrLifecycleDeferred. - Dispatch args structs (dispatch_args.go): StartDispatchArgs captures task, resolvedEnv, resolvedSecrets, inlineConfig, sharedDirs, sharedWorkspace, projectPath, projectSlug, harnessConfig. RestartDispatchArgs captures resolvedEnv. StopDispatchArgs is empty. All JSON-serializable for broker_dispatch.args column. - Owner-side executeDispatch (reconcile.go): start/stop/restart cases deserialize args, load agent from store, call local DispatchAgentStart/Stop/Restart via the dispatcher. Unknown ops (delete, finalize_env, etc.) still fail cleanly for B4-3/B4-4. Tests: waitForAgentTransition (terminal, error, rolling reset, silence expiry, context cancel, unsub); route-gating of Start/Stop/Restart returns ErrLifecycleDeferred when non-local; executeDispatch lifecycle cases invoke the local dispatcher; args round-trip (serialize→deserialize) is lossless; reconcile end-to-end lifecycle path. * feat(hub): wire originator-side cross-node lifecycle dispatch (B4-2 complete) The originator-side orchestration was missing: ErrLifecycleDeferred was returned by HybridBrokerClient but nothing caught it. Now the full cross-node start/stop/restart flow works transparently to all handler call sites. Originator side (HTTPAgentDispatcher): - DispatchAgentStart/Stop/Restart catch ErrLifecycleDeferred after env/secret resolution and invoke deferredLifecycle: 1. Subscribe("agent.<id>.status") BEFORE writing intent 2. InsertBrokerDispatch{op, agent_id, broker_id, args} 3. Best-effort SignalBrokerCmd (row is durable backstop) 4. waitForAgentTransition with terminal set per op 5. Return nil on success, error on error-phase/timeout - SetCrossNodeDeps(events, commandBus) wired in server.go's getOrCreateDispatcher, so all handler call sites get cross-node for free with synchronous semantics preserved. - Local path (routeLocal) is unchanged at zero added latency — no subscribe, no intent row, no wait. Args decision: owner RE-RESOLVES env/secrets via DispatchAgentStart (all hub instances share the same store + secret backend), so StartDispatchArgs carries only {Task}. RestartDispatchArgs and StopDispatchArgs are empty. This avoids serializing potentially large env/secrets into the DB while remaining correct because all hubs read from the same shared store. waitForAgentTransition refactored to a standalone function (no Server receiver) so the dispatcher can call it directly. Tests: - TestDeferredStart_WritesIntentAndWaits: deferred start writes a broker_dispatch row, waits, returns success on "running" event - TestDeferredStart_ReturnsErrorOnErrorPhase: error phase → error - TestLocalStart_SkipsIntentRow: local path calls tunnel directly, no intent row written - All existing tests pass (no regressions) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * feat(hub): cross-node delete + create-time data ops dispatch (B4-3, B4-4) Route-gate HybridBrokerClient.DeleteAgent, CheckAgentPrompt, CreateAgentWithGather, and FinalizeEnv through route() so routeForward/routeUndeliverable return ErrLifecycleDeferred (matching start/stop/restart pattern from B4-2). B4-3 (delete dispatch): - deferredDelete on ErrLifecycleDeferred: subscribe broker.dispatch.<id>.done → InsertBrokerDispatch{op:delete} → SignalBrokerCmd → waitForDispatchDone (reads DB row, authoritative). - Owner executeDispatch case "delete": deserializes DeleteDispatchArgs → local DispatchAgentDelete (idempotent, 404 ok). - DeleteDispatchArgs struct + UnmarshalDeleteArgs for args round-trip. B4-4 (create-time data ops): - deferredDataOp/deferredDataOpResult: common originator flow for ops that return results via the dispatch row (design §6.3). Subscribe to broker.dispatch.<id>.done BEFORE writing intent, insert dispatch, signal, waitForDispatchDone, read result from GetBrokerDispatch. - deferredCheckPrompt: returns bool from CheckPromptResult in row. - deferredFinalizeEnv: fire-and-forget via deferredDataOp. - deferredCreateWithGather: returns envRequirements from row result. - Owner executeDispatch cases: check_prompt, finalize_env, create — run local op, marshal result JSON, return it. - PublishDispatchDone on EventPublisher: slim completion event broker.dispatch.<id>.done emitted by reconcile loop on complete/fail. - waitForDispatchDone: event-driven wait with bounded re-read at rolling timeout (missed event recovery, design §6.3). - GetBrokerDispatch added to BrokerDispatchStore interface + entadapter. Local fast path unchanged (routeLocal → zero added latency). * feat(hub): stale-affinity + stuck-dispatch reaper singleton (B5-1) * feat(hub): pending-message sweep + dispatch metrics (B5-2) Add observability for the multi-node broker dispatch pipeline: Sweep: - CountStuckPendingMessages store method (messages pending > threshold) - brokerMessageSweepHandler registered as RecurringSingleton with LockBrokerMessageSweep (0x5C100007), runs every 1m Metrics (pkg/observability/dispatchmetrics): - Counters: dispatch published/claimed/done/failed, message dispatched - Gauge: message stuck (pending beyond 5m threshold) - Histograms: intent-to-done latency, reconcile drain duration - Counter: command bus reconnects Emit sites: - InsertBrokerDispatch → IncPublished (httpdispatcher.go) - ClaimBrokerDispatch → IncClaimed (reconcile.go) - CompleteBrokerDispatch → IncDone + RecordDispatchLatency (reconcile.go) - FailBrokerDispatch → IncFailed (reconcile.go) - MarkMessageDispatched → IncMessageDispatched (reconcile.go) - reconcileBroker → RecordReconcileDrainDuration (reconcile.go) - command bus reconnect → IncCmdBusReconnects (command_bus.go) - sweep handler → ObserveMessageStuck (sweep.go) * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a8) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a8 approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. * docs: project log for B5-3 chaos gate — GB5 PASSED (GA gate for broker dispatch) * fix(hub): align fakeHTTPClient.CleanupProject with interface (3 params, not 4) * fix(hub): address PR #305 review feedback - server_migrate.go: use nil-checked deferred close for src DB, and explicitly close src before dropSQLiteFile to prevent Windows sharing violations - server_migrate.go: handle file:// prefix before file: to correctly parse file:///path/to/db URLs - server_foreground.go: evaluate GetControlChannelManager() inside the ownsLocally closure to avoid capturing a stale nil value - server_migrate_test.go: add test case for file:/// URL format - server_test.go: sanitize t.Name() slashes in newTestStore to prevent SQLite path errors in subtests * docs: add project log for PR #305 review feedback fixes * fix(hub): prevent duplicate message delivery, guard dispatch state transitions C1: Call MarkMessageDispatched after successful local dispatch in messagebroker.go and handlers.go (single-recipient, set[], broadcast). Without this, successfully dispatched messages remained dispatch_state=pending and were re-delivered on every broker reconnect via reconcileBroker. C2: Return immediately in messagebroker.go deliverToAgent when CreateMessage fails — without a durable row, a deferred signal has nothing for the owning node to reconcile. C3: Guard CompleteBrokerDispatch and FailBrokerDispatch with state=in_progress CAS predicate so a done dispatch cannot be flipped to failed or vice versa. Update tests to claim before completing/failing to match the new CAS guard. * fix(hub): reconcile broker→eventbus and hub-native→hub-managed renames after rebase Post-rebase fixups to align the feature branch with main's refactoring: - broker package → eventbus package rename (types, imports, methods) - SetRecipient → GroupRecipient, SetMessageResponse → GroupMessageResponse - hubNativeProjectPath → hubManagedProjectPath - ProjectTypeHubNative → ProjectTypeHubManaged - populateAgentConfig gains ctx parameter - Add missing handleResourcesImport and handleMessageChannels handlers - Add ListChannels method to MessageBrokerProxy - Wire newCommandBus in server_foreground.go - Restore main's test fixtures for renamed APIs --------- Co-authored-by: scion-gteam[bot] <271067763+scion-gteam[bot]@users.noreply.github.com> Co-authored-by: Scion <agent@scion.dev>
…A Docker + Model B GKE) (GoogleCloudPlatform#306)
…GoogleCloudPlatform#303) * fix: atomic session-guarded broker disconnect to prevent reconnect race (GoogleCloudPlatform#131) The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects rapidly, the stale disconnect's offline stamp can clobber the new connection's online status because UpdateRuntimeBrokerHeartbeat has no session guard — it unconditionally overwrites status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart. Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also add a re-check guard before updating provider statuses. * docs: add project log for broker disconnect race fix unification
…rm#301) * docs(design): reduced resource clone/delete design (resolved review) * refactor: remove dead Locked field from Template and HarnessConfig models Remove the Locked bool field, all 16 enforcement sites across 6 handler files, the force query parameter from delete endpoints, 3 locked-template tests, and add a DB migration to drop the column. No production code ever set Locked=true — this simplifies the handlers for the upcoming clone/delete feature. * feat: add harness-config clone endpoint, authz hardening, and slug uniqueness - Add handleHarnessConfigClone mirroring template clone - Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone - Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id) - Return 409 Conflict on slug collision during clone - Add clone failure cleanup - Add tests for clone, authz, and slug collision * feat(web): add Clone/Delete row actions and clone-from-global to resource list - Add Clone and Delete action menu to shared resource-list component - Add delete confirmation dialog with deleteFiles checkbox (default on) - Add clone dialog with name input and 409 collision handling - Add clone-from-global picker in project settings view - Unify on resource-changed event (migrate resource-imported) - Gate actions on capabilities (canClone, canDelete properties) * fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method - Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails after files were already copied (prevents orphaned storage files) - Remove redundant confirmCloneFromGlobal method — confirmClone already handles cross-scope clone via the component's scope/scopeId properties * fix: adapt Locked removal and slug constraint to Ent-based schema Remove Locked references from entadapter, remove stale sqlite.go (replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id) to Ent schema indexes, and regenerate Ent code. * fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked) - Use api.NewUUID() for all test entity IDs (Ent enforces UUID format) - Remove Locked field from entadapter create/update calls - Remove stale sqlite.go (replaced by Ent ORM upstream) - Add UNIQUE(slug, scope, scope_id) to Ent schema indexes
…form#309) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. --------- Co-authored-by: Scion <agent@scion.dev>
…events (GoogleCloudPlatform#312) A rapid session.start → session.end sequence from a spurious sciontool could permanently reset an agent's phase even while the agent works normally. This adds two guards: 1. Phase regression guard: rejects transitions that would move an agent backward in its forward-progress lifecycle (e.g. running → starting) in both the status update handler and broker heartbeat handler. 2. Activity-driven phase auto-correction: when an activity that implies the agent is running (working, thinking, executing, etc.) arrives but the phase is pre-running, auto-promotes the phase to running. Fixes GoogleCloudPlatform#124
…GoogleCloudPlatform#313) Also unset SCION_PROJECT_ID when clearing hub context env vars, since IsHubContext() checks all four env vars and a leftover SCION_PROJECT_ID causes FindProjectRoot() to return a synthetic path instead of failing.
…tform#311) * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects added to action buttons in both views: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects use translucent rgba backgrounds that work in both light and dark mode: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Add before/after screenshots for PR review Screenshots captured from the real running app (Vite dev server + fetch mock for agent data). Shows before/after for both issues in light mode and dark mode. * Fix hover on disabled buttons and tooltip on disabled terminal Add :not([disabled]) to hover CSS selectors so color-coded hover effects don't apply to disabled action buttons. Wrap the Terminal button in an inline-flex span inside sl-tooltip so the tooltip remains accessible even when the button has pointer-events:none.
* docs(design): auth proxy mode (Google IAP) architecture Add design for an exclusive proxy human-auth mode that derives the user from a verified Google IAP signed header (X-Goog-IAP-JWT-Assertion), reusing the existing domain/allowlist/admin provisioning controls. Also specifies a hub-minted transport-auth layer (dedicated SA, generalizing PR GoogleCloudPlatform#307) so agents can traverse the IAP / Cloud Run-invoker front door, with a generalized array-based token refresh. * refactor(hub): extract provisionUser, dedupe OAuth find-or-create Extract the duplicated find-or-create-user block from four OAuth handlers (handleAuthLogin, handleAuthToken, handleCLIAuthToken, completeOAuthLogin) into a single provisionUser method on Server. The new method encapsulates: 1. Authorization check (isUserAuthorized) with audit logging 2. GetUserByEmail / CreateUser (find-or-create) 3. Profile backfill (DisplayName, AvatarURL when empty) 4. Admin promotion (when admin list changes) 5. Hub membership enrollment (ensureHubMembership) Introduces ExternalUserInfo struct (decoupled from OAuthUserInfo) and ErrAccessDenied sentinel error for caller-side HTTP response mapping. This is Phase 0 of the auth-proxy-mode feature — pure refactor with no behavior change. The proxy middleware (Phase 1) will call the same provisionUser method. NOTE: No suspended-user check is added. The existing OAuth flow does not check user.Status == "suspended" either; adding it here would change behavior. This gap is documented for Phase 1. * docs(project-log): record provisionUser extraction findings * feat(auth): implement proxy auth mode with IAP JWT verification (Phase 1) Add exclusive proxy auth mode for Google IAP signed-header authentication: - pkg/hub/proxyauth.go (NEW): ProxyAuthenticator interface, IAPAuthenticator with ES256 JWT verification via go-jose/v4, JWKS lazy-fetch cache with periodic refresh + on-miss refresh for unknown kids + transient failure tolerance (last-good keys). - pkg/config: auth.mode selector (oauth|proxy|dev), auth.proxy section with provider/iap.audience/overrides in both DevAuthConfig (GlobalConfig) and V1AuthConfig (settings.yaml). Wire conversion in both directions. - pkg/hub/auth.go: Replace IP-only extractProxyUser branch with ProxyAuthenticator path. Add 60s resolution cache (ProxyUserCache) wrapping provisionUser — signature verification runs every request, only the store lookup is cached. Legacy extractProxyUser preserved when no authenticator is configured. - pkg/hub/handlers_auth.go: Add suspended-user gate to provisionUser — rejects Status=="suspended" with ErrUserSuspended. This is an intentional behavior change sanctioned by the design doc, closing the pre-existing OAuth suspended-login gap documented in Phase 0. - pkg/hub/web.go: In proxy mode, handleAuthProviders returns no OAuth providers; handleLogout redirects to IAP's clear_login_cookie endpoint. - cmd/server_foreground.go: Construct IAPAuthenticator when mode==proxy && provider==iap, wire into ServerConfig.ProxyAuth. Security: audience binding is mandatory; only the signed JWT assertion is authoritative (X-Goog-Authenticated-User-* headers ignored); clock skew ±30s; JWKS cache handles key rotation and transient fetch failures. * test(auth): add comprehensive IAPAuthenticator unit tests Tests using self-generated ES256 key pair + httptest JWKS server: - Valid assertion -> correct ProxyUserInfo (subject/email stripped, lowercased) - Bad signature -> error - Wrong audience -> error (mandatory binding) - Wrong issuer -> error - Expired token (past 30s skew) -> error - Missing header -> (nil, nil) fall-through - Unknown kid triggers JWKS refresh and succeeds - Custom issuer override for testing - HD (hosted domain) claim extraction - Email lowercasing - JWKS cache transient failure tolerance (serves last-good keys) * style: fix gofmt formatting in proxyauth_test.go and settings_v1.go * docs(project-log): record auth-proxy-mode Phase 1 implementation * config: add auth.transport config for outbound transport auth Add TransportAuthConfig (hub_config.go) and V1TransportConfig (settings_v1.go) for the transport-layer auth that lets agents traverse IAP / Cloud Run invoker front doors. Config supports mode (none|cloudrun_invoker|iap), oidcAudience, and platformAuthSA fields. Wire into V1↔GlobalConfig conversion and env key mapping. Phase 2 item 6 of auth-proxy-mode. * hub: add TransportTokenMinter interface and implementations Introduce the TransportTokenMinter interface for minting Google OIDC ID tokens that let agents traverse platform guards (IAP / Cloud Run invoker). Three implementations: - gcpTransportMinter: production impl using IAM Credentials API (generateIdToken) to impersonate a dedicated platform-auth SA. Uses already-vendored google.golang.org/api/iamcredentials/v1. - noopTransportMinter: returns error when transport auth is disabled. - FakeTransportMinter: exported test double for other packages. Also adds RefreshTokenEntry type for the generalized tokens[] array and parseJWTExpiry for extracting expiry from ID tokens. All tests pass with no live GCP dependency (httptest fakes). Phase 2 item 6 of auth-proxy-mode. * hub: wire transport token minter into ServerConfig and dispatch Add TransportMode, TransportAudience, TransportMinter fields to ServerConfig and wire them through to the Server struct and HTTPAgentDispatcher. Transport tokens are injected as env vars (SCION_TRANSPORT_TOKEN, SCION_TRANSPORT_AUDIENCE, SCION_TRANSPORT_TOKEN_EXPIRY) into agent dispatch payloads in all three dispatch paths (Create, Start, Restart). server_foreground.go constructs a gcpTransportMinter from auth.transport config, deriving audience from hubEndpoint for cloudrun_invoker mode. When transport mode is "none" or unset, no minter is created and no transport tokens are injected — zero impact on existing deployments. Phase 2 item 6 of auth-proxy-mode. * hub: extend token refresh response with generalized tokens[] array The agent token refresh handler now returns a tokens[] array alongside the existing token/expires_at fields for backward compatibility. Old clients ignore tokens[]; new clients use it to apply both app-layer and transport-layer tokens. When transport auth is configured (transportMinter != nil), the response includes a google_oidc transport token entry with the configured audience. When disabled, only the app scion_access entry appears. Transport token minting errors are logged but don't fail the refresh — the app token is always returned. Phase 2 item 7 of auth-proxy-mode. * sciontool: add pluggable OIDC transport for agent outbound auth Implement the agent-side transport-layer auth with two pluggable token sources: - injectedTokenSource: uses the hub-provided SCION_TRANSPORT_TOKEN env var (cold start), then refreshed via the tokens[] array on subsequent refresh calls. - metadataTokenSource: fetches OIDC from the GCE metadata server (passthrough/on-GCE mode, the PR GoogleCloudPlatform#307 pattern). Selection logic: SCION_TRANSPORT_TOKEN env → injected mode; else if on GCE → metadata mode; else → no OIDC transport. The oidcTransport RoundTripper injects Authorization: Bearer on outbound hub requests. Graceful degradation: if token fetch fails, the request proceeds without the header (the hub can still auth via X-Scion-Agent-Token). Client changes: - Add oidcSource field and configureOIDCTransport() in NewClient() - Update RefreshTokenResponse with tokens[] array (backward compat) - RefreshToken() applies transport tokens via applyRefreshTokens() - Refresh scheduling uses shortest-lived entry (5-min margin for transport tokens vs 2h for scion tokens) 23 new tests covering both sources, transport, configuration, end-to-end dual-header, and refresh token application. Phase 2 item 8 of auth-proxy-mode. * docs(project-log): record auth-proxy-mode Phase 2 implementation * docs: add IAP proxy auth deployment guide (Phase 3) Add comprehensive deployment documentation for the IAP + Cloud Run invoker topology, covering inbound human IAP authentication, outbound agent transport auth (dual-layer OIDC + scion token), security considerations, and an end-to-end GCP setup checklist. All config keys and env vars verified against shipped code. * fix: prevent JWKS cache stampede and add HTTP client timeout - resolveHTTPClient() now returns a client with 10s timeout instead of http.DefaultClient (which has no timeout), preventing hangs on JWKS fetches. Tests that inject their own HTTPClient are unaffected. - JWKS cache refresh now debounces on lastAttempted (set at the start of every attempt, success or failure) instead of lastFetched (success only). This prevents stampedes during persistent JWKS outages where every cache-miss would trigger an unbounded refresh. - Added a refreshing guard to prevent concurrent in-flight refreshes (proactive background refresh + synchronous miss-refresh could race). - Network I/O is now performed outside the write lock to avoid holding the mutex across HTTP requests. - Added TestJWKSCache_StampedePreventionDuringOutage to verify that repeated misses during an outage do not cause repeated fetches within the debounce window. * fix: replace custom splitJWT with strings.Split and cache IAM service - Replace the hand-rolled splitJWT function with strings.Split(token, "."). Behavior is identical for well-formed JWTs; the custom function is deleted. - Cache the IAM credentials service client in gcpTransportMinter using sync.Once so it is created once and reused across MintIDToken calls instead of creating a new HTTP client/service on every invocation. Uses context.Background() for the long-lived client construction; per-call ctx continues to be passed to .Context(ctx).Do(). FakeTransportMinter is unaffected.
…oogleCloudPlatform#302) * fix: resolve workspace file browser to groves/ instead of projects/ The Hub UI file browser was showing the wrong directory contents. The hubManagedProjectPath() function resolved workspace paths to ~/.scion/projects/<slug>/ (project metadata) instead of ~/.scion/groves/<slug>/ (the actual git checkout mounted as /workspace in agents). Reverse the lookup priority: check groves/ first, fall back to projects/, and default to groves/ when neither has content. Fixes GoogleCloudPlatform#130 * docs: add project log for issue GoogleCloudPlatform#130 workspace path fix * fix: guard hubManagedProjectPath against empty slug Prevent hubManagedProjectPath from resolving to the parent directory when called with an empty slug. Add unit test for this case.
…by/owner_id) The Agent Ent schema modeled created_by/owner_id as foreign keys to the users table. When an agent creates a sub-agent, those columns hold the *creating agent's* ID, which has no users-table row, so Postgres rejected the insert with a foreign-key violation. mapError maps that to ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid input (status: 400)" on every agent-initiated `scion start`. User-created agents were unaffected, masking the regression (introduced when GoogleCloudPlatform#304 ported the agent store onto Ent). created_by/owner_id are polymorphic principal references (user OR agent), like ancestry. Drop the User-typed edges and keep them as plain principal UUID fields; resolve the delegation creator by ID and tolerate "no such user". Atlas AutoMigrate drops the two FK constraints on existing DBs at next boot. Tests: the sole sub-agent creation test only passed because it seeded a fake user row sharing the agent's ID — an impossible production state. Remove that workaround so it exercises the real path, and add store/ent regression tests asserting a non-user principal ID is accepted.
…o agent containers (GoogleCloudPlatform#322) * Add sciontool doctor and agent auth reset infrastructure When an agent's hub JWT expires and the refresh loop fails (e.g. hub signing key rotation), the agent becomes a zombie: running locally but invisible to the hub. This adds two features to diagnose and recover: 1. `sciontool doctor` command — runs inside the agent container to check env vars, token validity/expiry, hub connectivity, auth status, and GCP metadata/GitHub token health. Prints actionable remediation. 2. Auth reset mechanism — allows pushing a fresh token into a running agent without restarting. The flow is: - Hub generates a new agent JWT via DispatchAgentResetAuth - Broker's /reset-auth endpoint writes the token file via exec - Broker sends SIGUSR2 to sciontool init (PID 1) - Init re-reads the token, updates the hub client, restarts the token refresh loop, and sends an immediate heartbeat Also adds Client.SetToken() for in-memory token updates. * Add scion reset-auth CLI command and hub API endpoint Adds the user-facing `scion reset-auth <agent>` command that triggers an auth reset on a running agent via the Hub. Also adds: - Hub handler for POST /api/v1/agents/{id}/reset-auth - hubclient AgentService.ResetAuth() method --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
Adds a "Reset Auth" button in the agent detail header actions area,
visible when the agent is running. Clicking it calls the hub's
POST /api/v1/agents/{id}/reset-auth endpoint, which generates a
fresh JWT and pushes it into the running container without restart.
GoogleCloudPlatform#323) * Make SIGUSR2 signal best-effort in reset-auth handler The kill -USR2 step can fail (e.g. PID 1 is not sciontool init, or the process doesn't handle the signal). Since the token file write already succeeded and the refresh loop will pick up the new token without the signal, treat signal failure as a warning rather than returning a 500 error. * Add admin bulk reset-auth endpoint POST /api/v1/admin/agents/reset-auth-all lists all running agents and dispatches an auth reset for each, returning a per-agent success/failure summary. Admin role required. * Add Reset Auth All button to admin maintenance page Adds a Quick Actions section with a "Reset Auth — All Running Agents" button that calls POST /api/v1/admin/agents/reset-auth-all and displays a per-agent success/failure summary inline. --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
…gleCloudPlatform#340) The _linkedDiscordId @State() property was declared and assigned but never read in the template, causing TS6133 and failing the Verify Web Types CI step.
…udPlatform#342) * feat(agent-viz): add Agent Communications transcript panel Render a scrolling, human-readable transcript of inter-agent messages alongside the force-graph. The panel consumes the same `message` playback events that already drive the on-graph pulse lines, so it stays in sync with playback and is rebuilt from the snapshot on seek — no extra data source or backend change required. Broadcasts are highlighted and de-duplicated (by event time), timestamps are shown relative to playback start, and the panel is collapsible. - web/src/comms.ts: new CommsPanel component - web/src/main.ts: wire panel into the message / snapshot / reset paths - web/index.html: panel styling - README.md: document the panel * fix(agent-viz): avoid layout thrashing in CommsPanel during snapshot replay On seek, addMessage runs in a tight loop with animate=false. Reading scrollTop/clientHeight/scrollHeight before each appendChild forced a synchronous reflow per message — layout thrashing that can freeze the UI on large logs. The measurement is now skipped when not animating, and a single scroll-to-bottom is deferred to requestAnimationFrame after the replay loop (tracked and cancelled on reset).
Move the entire Provision body into a vendor-agnostic free function provisionShared(in ProvisionInput) error in workspace_provision.go. Relocate helpers as free functions (drop nfsBackend receiver): - acquireProvisionLock, gitCloneWorkspace, ensureWorktree, chownProjectTree, resolveUID, resolveGID - Constants: provisionSentinelFile, provisionLockRetries, provisionLockRetryDelay - Free helpers: writeSentinel, sanitizeBranchName (already free, just relocated for cohesion) resolveUID/resolveGID now default NFSUID/NFSGID → 1000 without reading cfg. Callers that need the cfg fallback should apply cfg defaults to ProvisionInput before calling provisionShared. Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
- Delete Provision method from WorkspaceBackend interface - Delete localBackend.Provision (was a no-op) - Delete nfsBackend.Provision and all moved method definitions (acquireProvisionLock, gitCloneWorkspace, ensureWorktree, chownProjectTree, resolveUID, resolveGID, writeSentinel, sanitizeBranchName, constants) — now live in workspace_provision.go - Update interface doc comment to reflect Resolve+Realize only, with a note that provisioning is now the standalone provisionShared Provision had NO production callers (only tests called it). Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
- Change all b.Provision(...) calls to provisionShared(...) in workspace_provision_test.go - Replace TestLocalBackendProvision_NoOp (deleted method) and TestNFSBackendProvision_NonGit with TestProvisionShared_NonGit in workspace_backend_test.go - Update TestAcquireProvisionLock_ContextCancellation in k8s_nfs_test.go to call free function acquireProvisionLock - All assertions remain identical; tests pass green Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
Add comments near MountDescriptor.Type and api.VolumeMount.Type noting that future vendor mount types (e.g. "cloudrun-volume", "gke-shared-volume") will be added as new Type values, while "nfs" remains the literal NFS protocol mount. No new enum values or behavior change. Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
Provision was removed from the WorkspaceBackend interface in this PR; the comment still referenced it. Review finding from PR GoogleCloudPlatform#170 (cosmetic).
Rename the unexported provisionShared function to ProvisionShared so that the new sciontool provision subcommand (in cmd/) can call it. Update all in-package callers and tests. No behavior change. Part of GoogleCloudPlatform#169 storage provisioning PR2.
In k8s init containers only the workspace dir is mounted (PVC subPath), not its parent. The sentinel file must therefore be written inside the workspace dir rather than its parent. Add ProvisionInput.SentinelDir: when empty, defaults to filepath.Dir(HostPath) preserving existing broker-side behavior; when set, the sentinel is placed there instead. Add tests proving default placement is unchanged and custom SentinelDir is honored with idempotency. Part of GoogleCloudPlatform#169 storage provisioning PR2.
Add a new `sciontool provision` subcommand that replaces the bespoke shell scripts previously used in NFS workspace init containers. Clone mode (default): reads SCION_CLONE_URL and SCION_CLONE_BRANCH from env vars (injection-safe), builds a ProvisionInput with SentinelDir set to the workspace path (required because only the workspace subPath is mounted), and calls runtime.ProvisionShared. Idempotent via sentinel. Wait mode (--wait-for-sentinel): polls for the sentinel file at 2s intervals with a 300s timeout, replicating nfsWaitForSentinelScript. Also exports ProvisionSentinelFile constant for use by the CLI. Part of GoogleCloudPlatform#169 storage provisioning PR2.
Replace the sh -c initScript command with a direct sciontool provision invocation: - Lock winner: sciontool provision --depth <N> - Lock loser: sciontool provision --wait-for-sentinel URL and branch continue to be passed via env vars (SCION_CLONE_URL, SCION_CLONE_BRANCH) for injection safety. Depth is a numeric flag. Init container name, image, securityContext, and volumeMounts unchanged. Broker-side lock winner/loser selection unchanged. Tests will be updated in the next commit to assert the new command format. Part of GoogleCloudPlatform#169 storage provisioning PR2.
Remove nfsInitProvisionScript, nfsWaitForSentinelScript, and nfsInitProvisionEnv — replaced by nfsProvisionCommand and nfsProvisionEnv which build sciontool provision invocations instead of shell scripts. Update all k8s_nfs_test.go assertions: instead of checking shell script text for git/sentinel/env-var references, tests now verify that: - Winner init container runs: sciontool provision --depth <N> - Loser init container runs: sciontool provision --wait-for-sentinel - URL/branch are passed via env vars, never in command args - Winner/loser selection, env injection safety, and no-clone-config cases are all preserved with equivalent coverage. Part of GoogleCloudPlatform#169 storage provisioning PR2.
Move ProvisionShared, its helpers, and the ProvisionInput/ResolvedWorkspace/ ResolvedSharedDir types from pkg/runtime to a new pkg/provision package that depends only on stdlib + pkg/api + pkg/store (no pkg/config). This fixes TestInitProjectDataIsolation: the lean sciontool binary can now import pkg/provision for its `provision` subcommand without transitively pulling in pkg/config (filesystem-based project path resolution). Backward-compatible type aliases and a thin ProvisionShared wrapper remain in pkg/runtime so existing callers (nfsBackend, broker) keep compiling. Tests for moved symbols (sanitizeBranchName, writeSentinel, validation, acquireProvisionLock context cancellation) moved to pkg/provision. Integration tests using nfsTestBackend stay in pkg/runtime.
…ntainer In a k8s init container only the workspace dir is mounted (subPath), so filepath.Dir(HostPath) resolves to "/" for HostPath=/workspace — the chown would recurse over the entire container root (chown -R 1000:1000 /). Today it's masked by dropped capabilities + a non-root security context (chown fails silently, provisioning still succeeds), but it's a correctness defect and a latent security hazard if that security context is ever relaxed. Extract the target computation into a testable chownTarget() helper that falls back to the workspace dir itself when the parent is "/" or ".", preserving broker-side behavior (chown the project root). Add unit coverage. Review finding from PR GoogleCloudPlatform#172 (medium severity, required fix).
…o 'ProvisionShared:' The Tier-1 body moved out of nfsBackend into the standalone ProvisionShared function, but its log/error message prefixes still said 'nfsBackend.Provision:', which is misleading now that it has real (non-NFS) callers like sciontool. Pure string change, no logic change. Review finding from PR GoogleCloudPlatform#170 (cosmetic), applied in PR GoogleCloudPlatform#172 where the code lives.
Address gemini-code-assist review on GoogleCloudPlatform#344: - Add optional ProvisionInput.Ctx (defaults to context.Background() when nil, so ProvisionShared's signature is unchanged for existing callers). - Thread ctx through gitCloneWorkspace, ensureWorktree, and chownProjectTree, switching their git/chown invocations to exec.CommandContext so a cancelled or timed-out context kills the child process instead of orphaning it. - gitCloneWorkspace now self-heals an incomplete prior clone: when the target dir is non-empty with no .git, clear its contents (removeDirContents keeps the dir, which may be a k8s mount point) and retry the clone once. - sciontool 'provision': import context, pass cmd.Context() into runProvision and runWaitForSentinel; set ProvisionInput.Ctx; make the sentinel poll loop select on ctx.Done()/time.After (was time.Sleep) and use time.Since. - Add TestProvisionCmd_WaitForSentinel_ContextCancel covering prompt cancellation of the poll loop.
…udPlatform#346) * fix(ci): handle unchecked error returns flagged by errcheck Explicitly discard error returns from os.Remove (cleanup in error paths) and os.WriteFile (test helper) so golangci-lint's errcheck linter passes. * style(ci): apply gofmt to unformatted source files Run gofmt on discord plugin and hub client files that were failing the CI format check.
…tform#169) Add Tier-3 vendor mount types to both MountDescriptor (runtime-level) and api.VolumeMount (config-level) discriminators. Each new type has its own validation requirements: - cloudrun-volume: requires volume_name (Cloud Run managed volume) - gke-shared-volume: requires volume_name (GKE Filestore CSI PVC) "nfs" remains the literal NFS protocol mount (server + export only). MountDescriptor gains a VolumeName field for the new types. api.VolumeMount gains a VolumeName field and updated Validate() switch. Unit tests cover valid + missing-required-field cases for both new types.
…volume (GoogleCloudPlatform#169) Introduce two new WorkspaceBackend implementations: - cloudrunVolumeBackend: emits MountDescriptor Type "cloudrun-volume" with VolumeName and SubPath for Cloud Run managed volumes. - gkeSharedVolumeBackend: emits MountDescriptor Type "gke-shared-volume" with VolumeName, PVClaimName, and SubPath for GKE Filestore CSI PVCs. Add config types V1CloudRunVolumeConfig and V1GKESharedVolumeConfig to V1WorkspaceStorageConfig. Extend SelectWorkspaceBackend to route "cloudrun-volume" and "gke-shared-volume" backend values (SharedPlain and WorktreePerAgent modes; ClonePerAgent still escapes to local). Unit tests cover Resolve, Realize, error cases, backend selection, and default target for all new backends.
…udPlatform#169) Implement CloudRunRuntime satisfying the Runtime interface, registered as "cloudrun" in factory.go GetRuntime. Key design decisions: Broker-side direct provisioning: Cloud Run with a host-mounted share calls provisionShared (Tier 1) DIRECTLY broker-side — no init container needed. The runtime's Run method provisions the workspace before the (deferred) service deployment step. For cloudrun-volume backends the platform provisions the volume, so provisionShared is skipped. Lifecycle methods (deploy/exec/logs via Cloud Run Admin API) return descriptive "not yet implemented" errors — the full container lifecycle is deferred to a follow-up PR. The provisioning and mount-realization wiring is complete and tested. Config: add V1CloudRunConfig (project, region) nested under V1RuntimeConfig.CloudRun. Factory wires WorkspaceStorage from server config for backend selection. Tests cover: runtime name/user, config plumbing, factory selection, broker-side provisioning with NFS (verifies directory + sentinel), cloudrun-volume skip path, missing ProjectID, and all lifecycle methods returning not-implemented errors.
…ionWorkspace Document that the hardcoded shared-plain mode is intentional for the initial Cloud Run runtime scope; per-agent worktrees are a follow-up. No logic change. Review finding from PR GoogleCloudPlatform#171 (low severity, observation 2).
Agents that run local web servers (dev servers, preview apps, test
harnesses) need guaranteed unique host ports. This adds a port pool
managed by the runtime broker that allocates unique ports from a
configurable range at agent startup and releases them on stop/delete.
Each agent receives environment variables:
- AVAILABLE_LOCALHOST_PORT_A, _B (port numbers)
- AVAILABLE_LOCALHOST_URL_A, _B (full URLs, when host_url is configured)
Ports are published via docker -p so they're reachable from the host.
Configuration in settings.yaml:
server:
broker:
port_pool:
range: "8000-9000"
ports_per_agent: 2
host_url: "http://broker-host-ip"
Defaults: range 8000-9000, 2 ports per agent, disabled unless configured.
Fixes from code review: 1. Port leak on agent restart: when an agent is stopped and restarted, Start() allocated new ports without releasing the old ones (which weren't freed because the previous run didn't go through Delete). Fix: call Release(name) before Allocate — idempotent no-op if no prior allocation exists. 2. Invalid URL with trailing slash: if host_url was configured as "http://example.com/", the constructed URL became "http://example.com/:8042". Fix: trim trailing slashes on init.
… review Debug/refactor findings: - NewPortPool now returns error on invalid inputs (bounds, min>max, perAgent<=0) - Allocate validates agent name and count - Added ParsePortRange unit tests - Added validation error tests for PortPool constructor - Updated callers to handle new error return - 267 lines of improvements, all tests pass
2cb45e9 to
a0d14ea
Compare
Owner
Author
|
Closing — upstream PR GoogleCloudPlatform#298 is open. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Agents frequently need to run local web servers — dev servers, preview apps, browser-based verification — but have no guaranteed way to get unique, non-colliding host ports. With multiple agents running concurrently on the same broker, hardcoded ports collide silently. Agents also have no way to tell users a URL where they can access the webserver.
What
A broker-managed port pool that allocates unique host ports to each agent at container creation time, exposed as environment variables:
AVAILABLE_LOCALHOST_PORT_A,AVAILABLE_LOCALHOST_PORT_B— raw port numbersAVAILABLE_LOCALHOST_URL_A,AVAILABLE_LOCALHOST_URL_B— full URLs (whenhost_urlis configured)Ports are published via
docker -pso they're reachable from the host. Allocation happens onscion start, release onscion stop/scion delete/ start failure.Configuration
Agent usage
How
New files
pkg/runtime/portpool.go— Thread-safe port allocator (sync.Mutex, lowest-available-port selection, allocate/release/query)pkg/runtime/portpool_test.go— Table-driven tests covering allocation, multi-agent, exhaustion, release/reuse, concurrent access (50 goroutines)Modified files
pkg/runtime/interface.go— AllocatedPorts and PortHostURL fields on RunConfigpkg/runtime/common.go— Env var injection + -p port publishing in buildCommonRunArgs() (covers Docker, Podman, Apple Container)pkg/runtime/k8s_runtime.go— Env var injection in buildPod() for Kubernetespkg/api/types.go— PortPoolConfig struct + ParsePortRange() with validation (1-65535)pkg/config/settings_v1.go— V1PortPoolConfig on V1BrokerConfig for settings.yamlpkg/agent/run.go— Port allocation before RunConfig construction, release on start failurepkg/agent/manager.go— Port release in Stop() and Delete() paths, PortPool fieldcmd/server_foreground.go— Pool initialization from broker settings at startupDesign decisions
Test plan