Skip to content

feat: add hub-managed port pool for agent containers#1

Closed
zeroasterisk wants to merge 75 commits into
mainfrom
feature/port-assignment
Closed

feat: add hub-managed port pool for agent containers#1
zeroasterisk wants to merge 75 commits into
mainfrom
feature/port-assignment

Conversation

@zeroasterisk

Copy link
Copy Markdown
Owner

Why

Agents frequently need to run local web servers — dev servers, preview apps, browser-based verification — but have no guaranteed way to get unique, non-colliding host ports. With multiple agents running concurrently on the same broker, hardcoded ports collide silently. Agents also have no way to tell users a URL where they can access the webserver.

What

A broker-managed port pool that allocates unique host ports to each agent at container creation time, exposed as environment variables:

  • AVAILABLE_LOCALHOST_PORT_A, AVAILABLE_LOCALHOST_PORT_B — raw port numbers
  • AVAILABLE_LOCALHOST_URL_A, AVAILABLE_LOCALHOST_URL_B — full URLs (when host_url is configured)

Ports are published via docker -p so they're reachable from the host. Allocation happens on scion start, release on scion stop / scion delete / start failure.

Configuration

server:
  broker:
    port_pool:
      enabled: true            # default: true when section is present
      range: "8000-9000"       # default
      ports_per_agent: 2       # default
      host_url: "http://35.232.118.211"  # optional, enables URL env vars

Agent usage

# Start a dev server on the assigned port
npm run dev -- --port $AVAILABLE_LOCALHOST_PORT_A

# Tell the user where to find it
echo "Visit $AVAILABLE_LOCALHOST_URL_A"

How

New files

  • pkg/runtime/portpool.go — Thread-safe port allocator (sync.Mutex, lowest-available-port selection, allocate/release/query)
  • pkg/runtime/portpool_test.go — Table-driven tests covering allocation, multi-agent, exhaustion, release/reuse, concurrent access (50 goroutines)

Modified files

  • pkg/runtime/interface.go — AllocatedPorts and PortHostURL fields on RunConfig
  • pkg/runtime/common.go — Env var injection + -p port publishing in buildCommonRunArgs() (covers Docker, Podman, Apple Container)
  • pkg/runtime/k8s_runtime.go — Env var injection in buildPod() for Kubernetes
  • pkg/api/types.go — PortPoolConfig struct + ParsePortRange() with validation (1-65535)
  • pkg/config/settings_v1.go — V1PortPoolConfig on V1BrokerConfig for settings.yaml
  • pkg/agent/run.go — Port allocation before RunConfig construction, release on start failure
  • pkg/agent/manager.go — Port release in Stop() and Delete() paths, PortPool field
  • cmd/server_foreground.go — Pool initialization from broker settings at startup

Design decisions

  • Opt-in: Pool is nil unless port_pool section is present in settings
  • Broker-scoped: Each broker manages its own pool independently
  • In-memory: No persistence — ports reclaimed on broker restart
  • All runtimes: Docker/Podman/Apple via buildCommonRunArgs, K8s via buildPod

Test plan

  • go build ./... — clean
  • go vet ./pkg/... ./cmd/... — clean
  • go test ./pkg/runtime/... — all pass including new portpool tests
  • go test ./pkg/agent/... ./pkg/api/... — all pass
  • Concurrency test: 50 goroutines allocating simultaneously, no duplicates
  • No double env injection
  • Port release on all exit paths (stop, delete, start failure)

ptone and others added 30 commits June 2, 2026 05:46
…rm#293)

* fix(scion-chat-app): set channel="gchat" on ask_user dialog responses

handleDialogSubmit was using the simple SendMessage API which doesn't
support structured message fields, so inbound ask_user responses arrived
at the hub with no channel set (defaulting to "web"). Switch to
SendStructuredMessage with Channel="gchat" to match the pattern already
used by cmdMessage.

* fix: channel filtering and thread-id routing for chat channel replies

Two bugs in the chat channel routing feature:

1. Channel filtering: broker plugins now check msg.Channel and skip
   messages targeted at a different channel. The hub injects plugin_name
   into broker credentials so each plugin knows its own channel identity.
   This prevents cross-channel delivery (e.g., Telegram replies leaking
   to Google Chat).

2. Thread-id routing: the Telegram plugin now passes msg.ThreadID as
   message_thread_id to the Telegram Bot API when sending outbound
   messages. Previously, thread-id was captured on inbound messages but
   never forwarded on outbound, causing replies to land in the wrong
   forum topic. Added SendOption variadic parameter to SendMessage,
   SendMessageWithKeyboard, and SendQueue.Send for backward-compatible
   thread-id support.

* feat(scion-chat-app): add Google Chat thread context support

Propagate thread IDs end-to-end so agents can participate in
Google Chat threads:

- Inbound: auto-set ThreadID on StructuredMessage from the Google Chat
  event's thread context when no explicit --thread flag is used
- Inbound: propagate ThreadID on dialog submit (ask_user responses)
- Outbound: pass ThreadID from StructuredMessage to SendMessageRequest
  so agent replies land in the correct Google Chat thread

* fix: route outbound messages to chat-app via ChannelID

The FanOutEventBus matched msg.Channel against the bus Name, but the
chat-app plugin is registered as "chat-app" while its messages use
channel="gchat". Add a ChannelID field to NamedEventBus and PluginInfo
so plugins can declare the channel they handle independently of their
registered name. The chat-app now reports ChannelID="gchat" via
GetInfo(), and the hub reads it at startup to wire routing correctly.

* design: per-topic /default agent scoping for Telegram forums

Explores how to let /default set a different default agent per
forum topic (message_thread_id) rather than per-chat. Conclusion:
~85 lines of changes across store, commands, callbacks, and routing.

* feat(scion-telegram): per-topic /default agent scoping for forum groups

Add support for setting a different default agent per Telegram forum
topic/thread, with the chat-wide default as fallback.

- New topic_defaults table keyed on (chat_id, thread_id)
- /default in a topic sets/shows the topic-level override
- Callback data extended: dflt:<slug>:<threadID> for topic scope
- Routing resolves topic default before chat default for both
  @bot-mention and unaddressed message fallback paths

* fix: address PR GoogleCloudPlatform#293 review feedback

- Add !no_sqlite build tag to resource_import_handler_test.go to fix CI
  vet failure (mockRoundTripper undefined when template_bootstrap_test.go
  is excluded)
- Guard debug log in broker.go Publish against nil msg to prevent panic
- Add fitCallback to preserve threadID suffix in Telegram callback_data
  when the 64-byte limit is exceeded, truncating agentSlug instead
- Add slog warning to truncateCallback when truncation occurs

* fix: address second round of PR GoogleCloudPlatform#293 review feedback

- Remove redundant channel filters from chat-app and Telegram Publish()
  methods — the FanOutEventBus already routes by ChannelID, and comparing
  against the plugin's registered name would silently drop messages
- Log errors from GetTopicDefault instead of silently ignoring them
- Return distinct error messages in chat-app when ResolveOrAutoRegister
  fails with a real error vs a nil mapping

* fix: address third round of PR GoogleCloudPlatform#293 review feedback

- Add early return for nil msg at top of Publish() to prevent panics
  in downstream handlers that dereference msg fields
- Add thread-safe ChannelName() getter on BrokerServer
- Use dynamic ChannelName() in GetInfo() instead of hardcoded "gchat"
- Use dynamic ChannelName() in both commands.go call sites

* fix: use callback_lookups for long callback data instead of truncation

Replace fitCallback() which corrupted agent slugs by truncating them
to fit Telegram's 64-byte limit. Long callback payloads are now stored
in the callback_lookups table with a short cblu:<id> reference.
HandleCallback resolves lookup IDs before routing.

Also add defensive check for empty HubUserEmail in chat-app to prevent
constructing invalid "user:" sender strings.

* fix: address fifth round of PR GoogleCloudPlatform#293 review feedback

- Use local interface instead of concrete *BrokerRPCClient type assertion
  in pluginChannelID() and isObserverBroker() so in-process brokers and
  mocks are handled correctly.
- Add nil guard for msg in fanout channel routing check.

---------

Co-authored-by: Scion <agent@scion.dev>
…eCloudPlatform#296)

* Fix test suite leaking Hub credentials, corrupting agent state (GoogleCloudPlatform#123)

Tests that spawn sciontool (e.g., TestInitCommand_Integration) inherited
live Hub env vars from the agent container, causing the subprocess to
talk to the real Hub and reset the agent phase to "starting."

- Add scrubHubEnv(t) helpers that use t.Setenv to clear Hub env vars
  (SCION_HUB_ENDPOINT, SCION_HUB_URL, SCION_AUTH_TOKEN, SCION_AGENT_ID,
  SCION_AGENT_MODE) with automatic restore on test cleanup
- Filter Hub env vars from subprocess Cmd.Env in TestInitCommand_Integration
  as belt-and-suspenders protection
- Convert os.Setenv/os.Unsetenv to t.Setenv throughout hub_test.go and
  client_test.go for crash-safe env var isolation

* Add project log entry for issue GoogleCloudPlatform#123 fix

* Address PR GoogleCloudPlatform#296 review feedback in init_test.go

Replace hardcoded /tmp/sciontool-test path with t.TempDir() to avoid
permission conflicts and test races. Replace map allocation in
filterHubEnv with slices.Contains on the static hubEnvVars slice.
…oogleCloudPlatform#299)

Three new documentation pages:

- External Channels: covers Telegram (bidirectional group chat),
  Discord (outbound webhooks), and A2A protocol bridge in one page.
  Summarizes concepts and links to detailed READMEs in extras/.

- Hub Setup on GCE: step-by-step walkthrough of deploying a hub
  using the starter-hub scripts. Covers provisioning, repo setup,
  TLS, and post-setup next steps.

- Multi-Broker Setup: how to connect multiple machines to a single
  hub for distributed agent execution. Covers architecture, broker
  registration, selection, and cross-broker considerations.

Sidebar updated to include all three pages.
* Add sort and filter capabilities to agent list view (GoogleCloudPlatform#71)

CLI: add --phase, --activity, --template filter flags and --sort,
--reverse sort flags to 'scion list'. Validates flag values against
known phases/activities. Passes phase filter server-side in hub mode
for efficiency.

Web UI: add phase filter chips (All/Running/Stopped/Suspended/Error),
sortable table headers (Name, Status, Updated), and sort dropdown for
grid view. Filter and sort state persists to localStorage.

Closes GoogleCloudPlatform#71

* Address review feedback: input canonicalization and validation

- CLI: canonicalize --phase/--activity/--sort to lowercase in
  validateListFlags, remove redundant empty check on filterActivity
- Web UI: validate localStorage phase filter against known values
  instead of raw cast
- Web UI: validate localStorage sort config field/dir values before
  applying
- Web UI: handle invalid date strings in formatRelativeTime with
  isNaN guard
…rm#295)

* Add prominent disconnected overlay to web terminal

When the WebSocket connection drops, a full-terminal overlay now appears
with 50% black opacity and large red "DISCONNECTED" text centered on it.
The overlay appears immediately on disconnect and disappears when the
connection is re-established. The small status indicator in the toolbar
remains as a secondary signal.

Fixes GoogleCloudPlatform#77

* Move disconnected overlay to be a sibling of xterm container

The overlay was a child of .terminal-container, whose DOM is managed by
xterm.js. Lit re-rendering the overlay on connect/disconnect state
changes conflicts with xterm's DOM management.

Fix: introduce .terminal-wrapper as the relative-positioning context,
make .terminal-container absolutely positioned inside it, and render
the overlay as a sibling — outside xterm's managed subtree.

* Use wasConnected flag instead of terminal ref for overlay reactivity

Replace the non-reactive `this.terminal` reference in the overlay
condition with a new `@state() wasConnected` flag. This fixes two issues:

1. Lit reactivity: `this.terminal` lacked `@state()` so changes to it
   didn't trigger re-renders. The new `wasConnected` is properly
   decorated as reactive state.

2. Initial connection: using `this.terminal` would flash the overlay
   during the brief window between terminal init and WebSocket open.
   `wasConnected` is only set true after the first successful connect,
   so the overlay only appears after a genuine disconnection.
…tore port, LISTEN/NOTIFY (GoogleCloudPlatform#304)

* P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib

- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
  OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq

* P0-2: add connection pool config to DatabaseConfig

- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime
  plus ConnMaxLifetimeDuration() helper
- DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1,
  load-bearing for write serialization)
- applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and
  forces sqlite MaxOpenConns=1; called in both load paths
- Mirror fields in V1DatabaseConfig + both conversion directions
- Wire pool settings into entc.OpenSQLite in initStore

* P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator

P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle.
A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive
Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter.
Ships group + policy domains and runs green against today's CompositeStore
(SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2.

P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across
all 30 domain tables, with edge cases (NULL optionals, max-length strings,
nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run
./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table
coverage report, and caches the blob to the scratchpad mount. CI gate fails if
any table has zero rows.

* feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3)

* feat(observability): add Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics (P0-5)

* P2: port notification + gcp/github/token domains to Ent entadapter

Add Ent-backed implementations of the notification, GCP service account,
GitHub App installation, and user access token store sub-interfaces:

- notification_store.go: NotificationStore (subscriptions, notifications,
  templates). Dispatch uses an atomic conditional update as the multi-replica
  claim primitive, and an optional NotificationPublisher designs in the
  LISTEN/NOTIFY fan-out for created/dispatched events.
- external_store.go: GCPServiceAccountStore + GitHubInstallationStore +
  UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE
  semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens
  support key-hash lookup. Legacy api_keys is intentionally not surfaced.
- storetest: add GCPServiceAccount, SubscriptionTemplate, and
  NotificationSubscription CRUD-parity domains.

Does not modify composite.go.

* P2: port schedule, maintenance, message domains to Ent entadapter

- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with
  dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the
  ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT
  on SQLite, SKIP LOCKED on Postgres).
- maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side
  seed (uuid.New) replacing SQLite randomblob() UUID seeds.
- message_store.go: CRUD, read flags, PurgeOldMessages, design-in
  PublishUserMessage hook for Postgres LISTEN/NOTIFY.
- pkg/ent/client_driver.go: hand-written Client.Driver() accessor for
  dialect detection + raw locking queries.

* feat(entadapter): port user + allowlist/invite domains to Ent (P2)

Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.

pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
  DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
  (so the plain unique index enforces case-insensitive uniqueness,
  equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
  EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
  AutoMigrate cannot emit a real lower(email) functional index across
  both SQLite (tests) and Postgres, so the invariant is enforced at the
  port layer.
- Offset-based pagination matching the legacy SQLite store.

pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
  Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
  the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
  (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
  which is race-free on both backends without SELECT...FOR UPDATE. The
  sql/lock feature is enabled and ForUpdate is available for genuine
  multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
  a plain column, not an Ent edge).

Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
  by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
  email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
  OnConflict and ForUpdate).

Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
  InviteCodeDomain oracle descriptors (kept in a separate file to avoid
  contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
  against the new adapters; behavior tests cover case-insensitivity,
  bulk idempotency, conditional increment, stats, and the invite join.

NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
    go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.

Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.

* P2: port secret/env_var + template/harness_config domains to Ent

Add Ent-backed store implementations for the secret/env and
template/harness domains, mirroring the legacy SQLite semantics:

- entadapter/secret_store.go: SecretStore implementing store.SecretStore
  + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE
  target->key projection, version bump on update, get-then-update upsert,
  and transitive ListProgenySecrets via a created_by IN-list over the
  ancestor set (user scope + allow_progeny only; encrypted value withheld).
- entadapter/template_store.go: TemplateStore implementing
  store.TemplateStore + store.HarnessConfigStore. base_template hierarchy,
  scope/project_id backwards-compat lookups, content_hash, JSON config/files
  columns, DeleteByScope. Subscription templates are owned by NotificationStore.
- Direct Ent unit tests incl. a progeny-inheritance parity test.
- storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired
  into RunStoreSuite for cross-backend CRUD parity.

* P2: port project/broker + brokersecret domains to Ent

Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.

- pkg/store/entadapter/project_store.go: implements ProjectStore,
  RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
  * provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
    (sql/upsert) keyed on the (project_id, broker_id) unique index.
  * runtime broker heartbeat/update use an optimistic version-CAS loop on a
    new internal lock_version token, serializing concurrent writers portably
    across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
  * slug lookups support case-insensitive matching (EqualFold).
  * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
    derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
  (per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
  (default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
  internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
  CRUD-parity domains.
- Unit tests for both adapters.

Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.

* P2: port agent domain to Ent entadapter (XL)

* chore(ent): regenerate Ent code for all 30 entity schemas

Regenerated with --feature sql/upsert,sql/lock to support
OnConflict upserts and ForUpdate/SKIP LOCKED job claims.

* P2-collapse: collapse dual-DB into single Ent store

Wire all Ent-backed sub-stores into CompositeStore via embedding, removing
the raw-SQL base store and the User/Agent/Project shadow-sync machinery
(ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves
every domain from a single Ent client and implements Close/Ping/Migrate
directly.

Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no
MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList,
and InviteCode domains in the storetest CRUD-parity suite. Update entadapter
tests for the single-DB NewCompositeStore(client) signature.

go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.

* P2-delete: remove raw-SQL store implementation

Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling
files (brokersecret, gcp_service_account, github_installation, maintenance,
messages, notification, project_sync_state, schedule, scheduled_event) plus
their tests, including the inline schema-migration scaffold. Keep driver.go,
which registers the pure-Go SQLite driver used by Ent's SQLite backend.

Repoint the two non-test consumers to the Ent-backed store:
  - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore.
  - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB.

go build ./... green; no remaining production references to the raw store.

* test: compile-migrate downstream suites to Ent store + fix signing-key PK

Replace the removed raw-SQL store in downstream tests with an Ent-backed
newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and
internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests
via a new CompositeStore.DB() escape-hatch accessor.

Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID
generated a non-UUID secret primary key, which the Ent secret store rejects;
it now derives a deterministic UUIDv5. go build ./... green; entadapter and
storetest suites green.

NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail
because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema
rejects; addressed in follow-up commits (tid() helper).

* test(hub): map non-UUID fixture IDs to UUIDs via tid() helper

Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the
UUID-PK Ent store accepts them while preserving cross-reference consistency and
ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining
failures are behavioral, not ID-format, and are addressed separately.

# Conflicts:
#	pkg/hub/handlers_project_test.go
#	pkg/hub/httpdispatcher_test.go

* fix(store): seed maintenance ops in Migrate; initStore uses Migrate

Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds
built-in maintenance operations (the raw store seeded these in its migrations).
initStore and hub test helpers call s.Migrate() so production and tests seed
consistently. Fixes the maintenance-operation hub tests (404 'Operation not
found'). pkg/hub failures 79 -> 71.

* test(hub): satisfy Ent NotEmpty validators in fixtures

Add slugs/broker names to test fixtures that previously relied on the raw
store's lenient (no-validator) inserts: project/agent slugs in the logs test
helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on
envgather ProjectProvider literals. pkg/hub failures 71 -> 57.

* test(secret): map non-UUID fixture IDs to UUIDs via tid()

Apply the tid() helper to pkg/secret fixtures (including a dynamically built
secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green.

* test(cmd): map non-UUID fixture IDs to UUIDs via tid(); add broker slug/name

Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the
dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor
broker_name to satisfy Ent validators. cmd now green except
TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary
(absent in this sandbox) and is unrelated to the store migration.

# Conflicts:
#	cmd/server_dispatcher_test.go

* test(hub): wrap remaining latent non-UUID fixture IDs

Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz,
agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the
remaining ~56 failures are behavioral (URL paths built from old raw IDs,
assertion mismatches), addressed next.

* fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers

Restore raw-SQL store parity: a malformed identifier cannot match any UUID
primary key, so get-by-id lookups now report store.ErrNotFound instead of
store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply
returned no row) and is what callers depend on — e.g. resolveTemplate passes a
template *name* to GetTemplate and relies on ErrNotFound to fall back to
slug-based resolution. New parseGetID helper applied across all 17 get-by-id
methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green.

* test(hub): fix store-less id wraps and project-route URL paths

- controlchannel_client_test: revert tid() wraps (store-less path-builder test;
  IDs must match the expected literal paths).
- github/envgather: project-scoped route handlers resolve the project by UUID id,
  so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id
  literal. pkg/hub failures 40 -> 32.

* test(hub): unwrap projectIDFromServiceAccountEmail expectation

The tid() sweep over-wrapped a non-ID expected value in a pure-function test;
restore the literal GCP project id.

* fix(ent): GCPServiceAccount.project_id is a string, not a UUID

The GCP service account project_id holds the GCP *cloud project* identifier
(e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared
it field.UUID, so entadapter CreateGCPServiceAccount/Update did
parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA
mint/create with a 400 in production (storetest masked it by passing a UUID).

Change the schema field to field.String, regenerate Ent, and store/read
project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub
31 -> 23.

* test(hub): fix GCP SA project-id assertion and project-settings id

Unwrap the over-wrapped 'my-project' expectation now that project_id is a
string, and wrap the dynamic project-settings project ID with tid().

* test(hub): fix bootstrap sync-to-finalize agent paths and storage keys

Build the finalize request path from the agent's tid() UUID and seed mock
storage under WorkspaceStoragePath(projectID, agent.ID) — the handler derives
the workspace key from the agent's real id, not the old raw name. pkg/hub
23 -> 19.

* test(hub): revert tid() over-wraps in store-less events_test

events_test exercises the in-memory ChannelEventPublisher directly; its
ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep
wrongly rewrote them so published subjects no longer matched the subscriptions
(timeouts). Restore the literal values. pkg/hub 19 -> 12.

* test(hub): fix maintenance-run path and notifications agentId queries

Use tid() UUIDs in the maintenance run-detail path and the notifications
agentId query params; guard list indexing with require.Len so a mismatch fails
cleanly instead of panicking (panics truncate the package run).

* test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared

Panics ([0] on empty lists) had been truncating the package run, hiding many
failures and starving the tid() sweep. With those guarded, sweep the newly
reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker /
seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project
IDs to tid(). No UUID-parse errors remain in pkg/hub.

* test(hub): unwrap tid() in scheduler_test (mock store, raw ids)

scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so
its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and
caused a nil-pointer panic that truncated the package run.

* fix(ent): Template.harness may be empty (raw-store parity)

A template imported from a directory that declares no harness type has an empty
harness; the raw-SQL store stored it, but the Ent NotEmpty validator made
BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and
regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub
package run (true failure count now visible).

* test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests

Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value
signing-key secret IDs now reachable after panic removal. No panics in the hub
package run.

* test(hub): convert raw-id URL path segments to tid()

Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs
and workspace sync routes from tid(rawID) so the by-id handlers resolve the
entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80.

* fix(entadapter)+test(hub): FK error mapping + permissions FK fixtures

mapError now distinguishes foreign-key violations (-> ErrInvalidInput, a bad
reference) from unique-constraint violations (-> ErrAlreadyExists); previously
both surfaced as a misleading 'already exists'/409.

Seed the users/agents that group memberships and policy bindings reference
(the Ent store enforces user/agent FK edges the raw store lacked), wrap
remaining raw fixture/URL ids in tid(), and give the AddAgent fixtures slugs.
All pkg/hub permissions tests pass.

* fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete

* test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators)

* test(hub): use tid() in principal/agent URL paths; broker slug in template_bootstrap

* fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs

* test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall

* test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs

* fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation

* feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres)

Implements 'scion server migrate --from sqlite://... --to postgres://...'
per postgres-strategy.md §7.3.

- entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL
  write), MaxOpenConns=1 so the source is never mutated.
- entc.MigrateData: generic reflection-based, dependency-ordered copy of all
  30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK
  already exists), atomic per entity (txn), chunked CreateBulk, source/dest
  row-count verification after each entity, plus the Group.child_groups M2M
  edge. FK columns are plain fields so edges are preserved via setters.
- cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL
  or keyword form), --keep-source default / --drop-source cutover, progress
  logging.

Verified end-to-end against live CloudSQL Postgres 16 (integration test +
real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips,
--drop-source removal.

* feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6)

Add cluster-coordination primitives so N stateless hub processes can share one
Postgres, each degrading to a no-op on single-writer SQLite:

- store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a
  dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat,
  stalled, purge, schedule-evaluator and github-health sweeps to one
  replica/tick.
- store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent
  claims one-shot events before side effects (dedup across replica startup
  recovery).
- CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single
  run on SQLite) for future multi-row invariants.
- dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5
  scaffold; wired into StartBackgroundServices via SetDBMetrics.

Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps,
notification atomic dispatch). Found and documented the schedule SKIP LOCKED
early-commit gap (lock released before the status transition), closed by the
singleton evaluator. Audit + budget docs in scratchpad.

Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl.
8-way concurrent), pool_sampler_test.go.

* feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher

P3-7: Decouple call sites from the concrete *ChannelEventPublisher.
- Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher
  interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher
  already had it.
- Factor the Publish* methods into a shared eventBuilder (sink func) so every
  backend emits identical subjects/payloads; ChannelEventPublisher embeds it.
- web.go (field + SetEventPublisher), messagebroker.go and notifications.go
  (field + constructor) now take EventPublisher; handlers_messages.go gates SSE
  on "not the no-op publisher" instead of a concrete type assertion.

P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery).
- Per-grove channels plus a global channel (flat exact-match); event type in the
  JSON envelope. Grove-scoped subjects publish to both the grove channel and the
  global channel; subscriptions group their patterns by resolved channel so an
  event is matched only against patterns that opted into the arriving channel
  (no double delivery).
- 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads
  (TTL-swept so every replica can refetch).
- PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish;
  rollback => no deliver). Delivery flows exclusively through the listener.
- Listener goroutine reconnects with backoff and re-LISTENs (resubscribe);
  dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does
  not invalidate the pgconn connection).
- Emits pkg/observability/dbmetrics signals (published/delivered/dropped,
  payload size, publish->deliver latency, reconnects, pool stats).
- cmd: newEventPublisher selects the backend by database driver (postgres =>
  PostgresEventPublisher, else ChannelEventPublisher) with safe fallback.

Tests: routing/registry/payload-offload/metrics/transactional-executor unit
tests run without a DB; cross-replica delivery, oversized round-trip,
transactional rollback, and reconnect+resubscribe are gated behind
SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green.

Note: server.go's equivalent type-assertion cleanup is left in the working tree
(co-edited with concurrent P0-5/scheduler work) and is functionally optional —
HEAD server.go already compiles against the widened interface.

* test(store): parameterize store suites over {sqlite, postgres} (P3-2)

Add pkg/store/enttest: a backend-selecting Ent client factory for the store
test suites. Default is in-memory SQLite; built with -tags integration and
SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres
database (created/dropped via TestMain) and isolates each test in its own
schema (search_path) so tests never observe each other's rows. Falls back to
SQLite when the env var is unset.

Route all entadapter and storetest helpers through enttest.NewClient so the
same CRUD-parity oracle runs unchanged against either backend.

Fix two real Postgres bugs surfaced by the new path:
- entadapter/dialect.go ancestryContains: emit the bind parameter via
  Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which
  was not rebound and produced a syntax error; and use jsonb_array_elements_text
  (the column is jsonb on Postgres, not json).
- schedule_store_test ClaimPath: make the concurrent-claim assertion
  backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every
  caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent
  callers may observe a disjoint subset (0..2) and must only never error or
  exceed 2.

Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL
Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).

* fix(hub): start dispatcher/broker for any subscription-capable EventPublisher

Wave C integration: newEventPublisher can now return a PostgresEventPublisher
(LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker
startup previously hard-asserted *ChannelEventPublisher, which silently skipped
starting them under Postgres. Gate on (not noop and not nil) instead, matching
the existing pattern in handlers_messages.go.

* fix(hub): harden Postgres event publish + verify wiring; lower PG pool default

Task 1 — LISTEN/NOTIFY publish path:
- Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real
  POST /api/v1/projects handler with a PostgresEventPublisher and asserts a
  pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact
  capability the multi-replica live test probed. Verified PASSING against live
  CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end
  to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also
  pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the
  current tree.
- Bound the autocommit publish (Publish* methods) with publishTimeout (5s).
  These run synchronously on the caller's (request handler) goroutine and
  acquire from the event pool; on a connection-starved instance that acquire
  could block indefinitely, stalling CRUD and silently never emitting NOTIFY.
  The timeout converts that into a logged error + dropped event (publishing is
  fire-and-forget). PublishTx (transactional path) is unaffected.

Task 2 — connection budget:
- Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a
  modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance
  scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections
  set to 100 (out of band).

* test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process)

Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior
the SQLite parity suites cannot reach. Gated by //go:build integration and
SCION_TEST_POSTGRES_URL; skips cleanly otherwise.

Coverage:
- Contention: state_version CAS race (no lost updates, >=N-1 retries, final
  version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner +
  disjoint drain), unique-key races (project slug, user email, agent slug).
- Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE
  READ no-phantom snapshot, READ COMMITTED dirty-read prevention.
- Pool: exhaustion + queued recovery, saturated pool honoring context deadline,
  long txn not starving short queries, healing after pg_terminate_backend.
- LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener
  reconnect/resume, cross-channel isolation.
- Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration.
- Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text
  non-truncation, TIMESTAMPTZ microsecond precision.
- Multi-process: forks the test binary for cross-process advisory-lock
  exclusivity and cross-process NOTIFY delivery.

Configurable concurrency via SCION_TEST_CONCURRENCY (default 10).

Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open
custom-pool clients and share a DSN with forked child processes; non-integration
stubs keep the package API stable.

* fix(db): recycle stale conns + keepalives; skip singleton tick on lock error

Stale-connection pool stalls (CloudSQL drops idle conns after ~10m):
- Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite)
  and apply SetConnMaxIdleTime on the database/sql pool.
- OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with
  TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect
  timeout, so a silently-dropped peer is detected instead of the first query
  after idle hanging on a dead socket.
- pgx event pool (events_postgres.go): set keepalives + connect timeout on
  both the pool's ConnConfig and the dedicated listener connection, plus
  MaxConnIdleTime 5m / MaxConnLifetime 30m.

Advisory-lock leader election (scheduler.go):
- A lock-acquisition error no longer falls open to running the handler
  unguarded (which would duplicate singleton work across replicas); the tick
  is skipped and retried next interval. Added regression tests.

Test harness (enttest/integrationtest):
- Accept libpq keyword/value DSNs (not just URL form) when deriving the
  ephemeral db/schema/params; add WithConnParam helper.
- Fix migration idempotency test's per-pass row-count expectation.

* fix(store): bound advisory-lock conn checkout + unlock with short timeout

TryAdvisoryLock checked a connection out of the pool and ran the unlock
on the full 55s scheduler-handler context (acquire) and an unbounded
context.Background() (release). On a pool that could not promptly serve a
healthy connection, db.Conn() blocked for the entire 55s before failing
with 'context deadline exceeded' on every tick; with several singleton
handlers firing each 60s tick, those long-blocked goroutines and their
pending pool connection requests piled up across ticks and kept the pool
jammed (checked out client-side, idle server-side).

The unbounded unlock was a second leak vector: if the held connection
died mid critical-section, ExecContext could hang forever, so conn.Close()
never ran and the connection leaked out of the pool permanently.

Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release
(pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries
next tick instead of parking a goroutine for ~55s, and so a dead
connection can never block release from freeing the conn. Lock semantics
are unchanged: cancelling the acquire context tears down only that
context, not the checked-out session that holds the lock.

* feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent)

Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema
from the removed pkg/store/sqlite store) to the consolidated Ent-backed
SQLite schema, in-process on first boot, behind an automatic backup.

pkg/ent/entc/migrate_alpha.go:
- IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the
  legacy-only agents.agent_id column (no-op for an Ent/empty/absent file).
- MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>),
  AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table
  with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then
  atomically swap the migrated file into place.
- Data-driven column mapping (created_at→created, updated_at→updated,
  agents.agent_id→slug, policies→access_policies); bespoke SQL for the
  group_members/policy_bindings polymorphic splits and surrogate ids;
  groups.parent_id→group_child_groups edge.
- Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal
  signing-key secrets; plugin runtime-broker ids) with consistent rewrite
  of every foreign-key reference via a TEMP _id_remap table.
- Tolerates missing legacy tables (older schema versions).

cmd/server_foreground.go: detect + migrate in initStore's sqlite path,
with a --no-auto-migrate operator opt-out (cmd/server.go).

Validated end-to-end against four production hub.db files (scion-integration,
-integration2, -demo, -gteam): exact row-count parity (up to ~19k rows),
every entity reads back through the live Ent store, idempotent re-runs, and
broker FK references resolve post-remap. Pre-existing dangling agent
created_by/owner_id refs are faithfully preserved (loader runs FK-off).

* fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool)

The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the
value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only
bumped postgres to a real pool when the value was <= 0, but a postgres
deployment configured via env/driver override inherits the embedded default
of 1, so the guard never fired and the Ent pool ran with a SINGLE connection.

Effect in production (both integration hubs): every singleton scheduler tick
checks out the lone pool connection to hold its advisory lock, then blocks
waiting for a second connection to do its work — a self-deadlock that resolves
only at the 55s handler context deadline. All API requests serialize behind
the one connection, so GET /api/v1/* served in ~55s across the board.

Note env overrides could not paper over this: envKeyToConfigKey splits on
every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to
database.max.open.conns, not database.max_open_conns — silently ignored.

Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool
default (10) applies; explicit sizing of 2+ is still respected. SQLite remains
pinned to 1. Adds regression tests for all three cases.

* docs: add multi-node broker dispatch and NFS workspace designs

- broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for
  cross-replica broker command routing and agent lifecycle dispatch
- nfs-workspace.md: NFS workspace coordination for VM (host bind-mount)
  and K8s/Cloud Run (per-pod mount) runtime models

* fix(store): address PR GoogleCloudPlatform#304 review — context leaks and DSN parsing

Thread the server's cancellable context into initStore and
initWebServer instead of using context.Background(), so that:

- DB migrations and the health-check ping cancel on Ctrl+C during
  startup (medium-priority review comment).
- The Postgres LISTEN/NOTIFY event publisher goroutine shuts down
  cleanly when the server exits, preventing connection leaks
  (high-priority review comment).

Also fix parseSQLiteSourceDSN to handle the file:// prefix before
the file: prefix, so that file:///var/lib/hub.db correctly resolves
to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for
file:// and file:/// DSN forms.

* docs: add project log for PR GoogleCloudPlatform#304 review fixes

* fix(store): context leak in legacy migration & double file: prefix

1. Thread the server's cancellable context through
   maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C
   during first-boot legacy migration aborts it instead of running
   with an uncancellable context.Background().

2. Guard against a double "file:" prefix when constructing the
   SQLite DSN. If the operator's database.url already starts with
   "file:", we no longer blindly prepend another "file:" prefix.
   Also correctly appends cache=shared with "&" when the DSN
   already contains query parameters.

* fix(store): rename ProjectTypeHubNative → ProjectTypeHubManaged (rebase fixup)

Upstream renamed hub-native to hub-managed while the PR was in
flight. Update the two remaining references that the rebase
conflict resolution missed.

---------

Co-authored-by: Scion <agent@scion.dev>
…t token

TestClient_StartTokenRefresh exercised RefreshToken -> WriteTokenFile
without isolating the token home, so running the suite inside a live
agent container overwrote the real ~/.scion/scion-token with the test
stub "refreshed-token". Every subsequent Hub call then 401'd with
"compact JWS format must have three parts" / "unrecognized token format".

- Add SetTokenHome(t.TempDir()) to the test, matching its siblings.
- Guard WriteTokenFile: panic under `go test` unless SetTokenHome was
  called, so a forgotten isolation can never corrupt live state again.
  Reads remain unguarded (harmless; return empty when absent).
…ecycle + message routing (GoogleCloudPlatform#305)

* Add canonical engineering glossary (GLOSSARY.md) (#102)

* Add engineering glossary (GLOSSARY.md) with canonical terms and cleanup tracker

Add a root-level GLOSSARY.md capturing canonical Scion terminology in the
ubiquitous-language format (preferred term + synonyms to avoid), grouped by
domain cluster, plus an Exceptions & Future Cleanup section tracking known
naming-convergence work. Link it from agents.md as the canonical engineering
glossary.

* Revise glossary: broker reframe, Event Bus, Hub-managed, and term refinements

Refine entries from review: redefine Message Broker as the pluggable
messaging-integration system (add Broker plugin, Built-in broker); add Event
Bus for the NATS real-time/event capability; collapse hub-native/Hub Workspace
into Hub-managed project/workspace; tighten Template (harness-agnostic, optional
default harness-config), Skill (template-only, Agent Skills link), Profile
(named runtime-broker settings bundle), Harness/Harness-config; reframe Hub as
the control plane in both modes; add Group and Message Group. Expand Exceptions
& Future Cleanup to nine tracked items.

* Glossary: restructure headings, add cross-refs, modes table, and new terms

- Retitle to "Scion Glossary"; drop the "Language" wrapper and promote
  the thematic categories to top-level sections
- Add an Operations section (Attach, Dispatch) and move Profile next to
  Runtime Broker
- Add a Local/Workstation/Hosted comparison table and "See also"
  cross-refs across the main confusable term clusters
- Reframe the intro around the three-way broker collision (incl. Event
  Bus) and defer to the disambiguation rule; sentence-case "Shared
  directory"
- Add canonical entries for Secret, Notification, and Schedule
- Add a "Potential Future Additions" section cataloguing candidate terms

* Glossary: remove Exceptions & Future Cleanup tracker

The cleanup items are now tracked by dedicated agents that open GitHub
issues and implementation PRs, so the staged tracker no longer lives in
the glossary. Reword the two intro/disambiguation references that pointed
at the removed section to point at GitHub issues instead.

---------

Co-authored-by: Preston Holmes <ptone@google.com>

* P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib

- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
  OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq

* P0-2: add connection pool config to DatabaseConfig

- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime
  plus ConnMaxLifetimeDuration() helper
- DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1,
  load-bearing for write serialization)
- applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and
  forces sqlite MaxOpenConns=1; called in both load paths
- Mirror fields in V1DatabaseConfig + both conversion directions
- Wire pool settings into entc.OpenSQLite in initStore

* P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator

P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle.
A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive
Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter.
Ships group + policy domains and runs green against today's CompositeStore
(SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2.

P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across
all 30 domain tables, with edge cases (NULL optionals, max-length strings,
nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run
./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table
coverage report, and caches the blob to the scratchpad mount. CI gate fails if
any table has zero rows.

* feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3)

* P2: port notification + gcp/github/token domains to Ent entadapter

Add Ent-backed implementations of the notification, GCP service account,
GitHub App installation, and user access token store sub-interfaces:

- notification_store.go: NotificationStore (subscriptions, notifications,
  templates). Dispatch uses an atomic conditional update as the multi-replica
  claim primitive, and an optional NotificationPublisher designs in the
  LISTEN/NOTIFY fan-out for created/dispatched events.
- external_store.go: GCPServiceAccountStore + GitHubInstallationStore +
  UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE
  semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens
  support key-hash lookup. Legacy api_keys is intentionally not surfaced.
- storetest: add GCPServiceAccount, SubscriptionTemplate, and
  NotificationSubscription CRUD-parity domains.

Does not modify composite.go.

* P2: port schedule, maintenance, message domains to Ent entadapter

- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with
  dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the
  ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT
  on SQLite, SKIP LOCKED on Postgres).
- maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side
  seed (uuid.New) replacing SQLite randomblob() UUID seeds.
- message_store.go: CRUD, read flags, PurgeOldMessages, design-in
  PublishUserMessage hook for Postgres LISTEN/NOTIFY.
- pkg/ent/client_driver.go: hand-written Client.Driver() accessor for
  dialect detection + raw locking queries.

* feat(entadapter): port user + allowlist/invite domains to Ent (P2)

Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.

pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
  DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
  (so the plain unique index enforces case-insensitive uniqueness,
  equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
  EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
  AutoMigrate cannot emit a real lower(email) functional index across
  both SQLite (tests) and Postgres, so the invariant is enforced at the
  port layer.
- Offset-based pagination matching the legacy SQLite store.

pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
  Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
  the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
  (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
  which is race-free on both backends without SELECT...FOR UPDATE. The
  sql/lock feature is enabled and ForUpdate is available for genuine
  multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
  a plain column, not an Ent edge).

Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
  by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
  email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
  OnConflict and ForUpdate).

Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
  InviteCodeDomain oracle descriptors (kept in a separate file to avoid
  contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
  against the new adapters; behavior tests cover case-insensitivity,
  bulk idempotency, conditional increment, stats, and the invite join.

NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
    go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.

Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.

* P2: port secret/env_var + template/harness_config domains to Ent

Add Ent-backed store implementations for the secret/env and
template/harness domains, mirroring the legacy SQLite semantics:

- entadapter/secret_store.go: SecretStore implementing store.SecretStore
  + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE
  target->key projection, version bump on update, get-then-update upsert,
  and transitive ListProgenySecrets via a created_by IN-list over the
  ancestor set (user scope + allow_progeny only; encrypted value withheld).
- entadapter/template_store.go: TemplateStore implementing
  store.TemplateStore + store.HarnessConfigStore. base_template hierarchy,
  scope/project_id backwards-compat lookups, content_hash, JSON config/files
  columns, DeleteByScope. Subscription templates are owned by NotificationStore.
- Direct Ent unit tests incl. a progeny-inheritance parity test.
- storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired
  into RunStoreSuite for cross-backend CRUD parity.

* P2: port project/broker + brokersecret domains to Ent

Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.

- pkg/store/entadapter/project_store.go: implements ProjectStore,
  RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
  * provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
    (sql/upsert) keyed on the (project_id, broker_id) unique index.
  * runtime broker heartbeat/update use an optimistic version-CAS loop on a
    new internal lock_version token, serializing concurrent writers portably
    across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
  * slug lookups support case-insensitive matching (EqualFold).
  * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
    derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
  (per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
  (default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
  internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
  CRUD-parity domains.
- Unit tests for both adapters.

Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.

* P2: port agent domain to Ent entadapter (XL)

* chore(ent): regenerate Ent code for all 30 entity schemas

Regenerated with --feature sql/upsert,sql/lock to support
OnConflict upserts and ForUpdate/SKIP LOCKED job claims.

* P2-collapse: collapse dual-DB into single Ent store

Wire all Ent-backed sub-stores into CompositeStore via embedding, removing
the raw-SQL base store and the User/Agent/Project shadow-sync machinery
(ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves
every domain from a single Ent client and implements Close/Ping/Migrate
directly.

Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no
MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList,
and InviteCode domains in the storetest CRUD-parity suite. Update entadapter
tests for the single-DB NewCompositeStore(client) signature.

go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.

* P2-delete: remove raw-SQL store implementation

Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling
files (brokersecret, gcp_service_account, github_installation, maintenance,
messages, notification, project_sync_state, schedule, scheduled_event) plus
their tests, including the inline schema-migration scaffold. Keep driver.go,
which registers the pure-Go SQLite driver used by Ent's SQLite backend.

Repoint the two non-test consumers to the Ent-backed store:
  - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore.
  - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB.

go build ./... green; no remaining production references to the raw store.

* test: compile-migrate downstream suites to Ent store + fix signing-key PK

Replace the removed raw-SQL store in downstream tests with an Ent-backed
newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and
internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests
via a new CompositeStore.DB() escape-hatch accessor.

Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID
generated a non-UUID secret primary key, which the Ent secret store rejects;
it now derives a deterministic UUIDv5. go build ./... green; entadapter and
storetest suites green.

NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail
because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema
rejects; addressed in follow-up commits (tid() helper).

* test(hub): map non-UUID fixture IDs to UUIDs via tid() helper

Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the
UUID-PK Ent store accepts them while preserving cross-reference consistency and
ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining
failures are behavioral, not ID-format, and are addressed separately.

* fix(store): seed maintenance ops in Migrate; initStore uses Migrate

Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds
built-in maintenance operations (the raw store seeded these in its migrations).
initStore and hub test helpers call s.Migrate() so production and tests seed
consistently. Fixes the maintenance-operation hub tests (404 'Operation not
found'). pkg/hub failures 79 -> 71.

* test(hub): satisfy Ent NotEmpty validators in fixtures

Add slugs/broker names to test fixtures that previously relied on the raw
store's lenient (no-validator) inserts: project/agent slugs in the logs test
helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on
envgather ProjectProvider literals. pkg/hub failures 71 -> 57.

* fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers

Restore raw-SQL store parity: a malformed identifier cannot match any UUID
primary key, so get-by-id lookups now report store.ErrNotFound instead of
store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply
returned no row) and is what callers depend on — e.g. resolveTemplate passes a
template *name* to GetTemplate and relies on ErrNotFound to fall back to
slug-based resolution. New parseGetID helper applied across all 17 get-by-id
methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green.

* test(hub): fix store-less id wraps and project-route URL paths

- controlchannel_client_test: revert tid() wraps (store-less path-builder test;
  IDs must match the expected literal paths).
- github/envgather: project-scoped route handlers resolve the project by UUID id,
  so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id
  literal. pkg/hub failures 40 -> 32.

* test(hub): unwrap projectIDFromServiceAccountEmail expectation

The tid() sweep over-wrapped a non-ID expected value in a pure-function test;
restore the literal GCP project id.

* fix(ent): GCPServiceAccount.project_id is a string, not a UUID

The GCP service account project_id holds the GCP *cloud project* identifier
(e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared
it field.UUID, so entadapter CreateGCPServiceAccount/Update did
parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA
mint/create with a 400 in production (storetest masked it by passing a UUID).

Change the schema field to field.String, regenerate Ent, and store/read
project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub
31 -> 23.

* test(hub): fix GCP SA project-id assertion and project-settings id

Unwrap the over-wrapped 'my-project' expectation now that project_id is a
string, and wrap the dynamic project-settings project ID with tid().

* test(hub): revert tid() over-wraps in store-less events_test

events_test exercises the in-memory ChannelEventPublisher directly; its
ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep
wrongly rewrote them so published subjects no longer matched the subscriptions
(timeouts). Restore the literal values. pkg/hub 19 -> 12.

* test(hub): fix maintenance-run path and notifications agentId queries

Use tid() UUIDs in the maintenance run-detail path and the notifications
agentId query params; guard list indexing with require.Len so a mismatch fails
cleanly instead of panicking (panics truncate the package run).

* test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared

Panics ([0] on empty lists) had been truncating the package run, hiding many
failures and starving the tid() sweep. With those guarded, sweep the newly
reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker /
seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project
IDs to tid(). No UUID-parse errors remain in pkg/hub.

* test(hub): unwrap tid() in scheduler_test (mock store, raw ids)

scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so
its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and
caused a nil-pointer panic that truncated the package run.

* fix(ent): Template.harness may be empty (raw-store parity)

A template imported from a directory that declares no harness type has an empty
harness; the raw-SQL store stored it, but the Ent NotEmpty validator made
BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and
regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub
package run (true failure count now visible).

* test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests

Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value
signing-key secret IDs now reachable after panic removal. No panics in the hub
package run.

* test(hub): convert raw-id URL path segments to tid()

Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs
and workspace sync routes from tid(rawID) so the by-id handlers resolve the
entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80.

* fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete

* test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators)

* fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs

* test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall

* test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs

* fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation

* feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres)

Implements 'scion server migrate --from sqlite://... --to postgres://...'
per postgres-strategy.md §7.3.

- entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL
  write), MaxOpenConns=1 so the source is never mutated.
- entc.MigrateData: generic reflection-based, dependency-ordered copy of all
  30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK
  already exists), atomic per entity (txn), chunked CreateBulk, source/dest
  row-count verification after each entity, plus the Group.child_groups M2M
  edge. FK columns are plain fields so edges are preserved via setters.
- cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL
  or keyword form), --keep-source default / --drop-source cutover, progress
  logging.

Verified end-to-end against live CloudSQL Postgres 16 (integration test +
real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips,
--drop-source removal.

* feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6)

Add cluster-coordination primitives so N stateless hub processes can share one
Postgres, each degrading to a no-op on single-writer SQLite:

- store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a
  dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat,
  stalled, purge, schedule-evaluator and github-health sweeps to one
  replica/tick.
- store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent
  claims one-shot events before side effects (dedup across replica startup
  recovery).
- CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single
  run on SQLite) for future multi-row invariants.
- dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5
  scaffold; wired into StartBackgroundServices via SetDBMetrics.

Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps,
notification atomic dispatch). Found and documented the schedule SKIP LOCKED
early-commit gap (lock released before the status transition), closed by the
singleton evaluator. Audit + budget docs in scratchpad.

Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl.
8-way concurrent), pool_sampler_test.go.

* feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher

P3-7: Decouple call sites from the concrete *ChannelEventPublisher.
- Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher
  interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher
  already had it.
- Factor the Publish* methods into a shared eventBuilder (sink func) so every
  backend emits identical subjects/payloads; ChannelEventPublisher embeds it.
- web.go (field + SetEventPublisher), messagebroker.go and notifications.go
  (field + constructor) now take EventPublisher; handlers_messages.go gates SSE
  on "not the no-op publisher" instead of a concrete type assertion.

P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery).
- Per-grove channels plus a global channel (flat exact-match); event type in the
  JSON envelope. Grove-scoped subjects publish to both the grove channel and the
  global channel; subscriptions group their patterns by resolved channel so an
  event is matched only against patterns that opted into the arriving channel
  (no double delivery).
- 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads
  (TTL-swept so every replica can refetch).
- PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish;
  rollback => no deliver). Delivery flows exclusively through the listener.
- Listener goroutine reconnects with backoff and re-LISTENs (resubscribe);
  dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does
  not invalidate the pgconn connection).
- Emits pkg/observability/dbmetrics signals (published/delivered/dropped,
  payload size, publish->deliver latency, reconnects, pool stats).
- cmd: newEventPublisher selects the backend by database driver (postgres =>
  PostgresEventPublisher, else ChannelEventPublisher) with safe fallback.

Tests: routing/registry/payload-offload/metrics/transactional-executor unit
tests run without a DB; cross-replica delivery, oversized round-trip,
transactional rollback, and reconnect+resubscribe are gated behind
SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green.

Note: server.go's equivalent type-assertion cleanup is left in the working tree
(co-edited with concurrent P0-5/scheduler work) and is functionally optional —
HEAD server.go already compiles against the widened interface.

* test(store): parameterize store suites over {sqlite, postgres} (P3-2)

Add pkg/store/enttest: a backend-selecting Ent client factory for the store
test suites. Default is in-memory SQLite; built with -tags integration and
SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres
database (created/dropped via TestMain) and isolates each test in its own
schema (search_path) so tests never observe each other's rows. Falls back to
SQLite when the env var is unset.

Route all entadapter and storetest helpers through enttest.NewClient so the
same CRUD-parity oracle runs unchanged against either backend.

Fix two real Postgres bugs surfaced by the new path:
- entadapter/dialect.go ancestryContains: emit the bind parameter via
  Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which
  was not rebound and produced a syntax error; and use jsonb_array_elements_text
  (the column is jsonb on Postgres, not json).
- schedule_store_test ClaimPath: make the concurrent-claim assertion
  backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every
  caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent
  callers may observe a disjoint subset (0..2) and must only never error or
  exceed 2.

Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL
Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).

* fix(hub): harden Postgres event publish + verify wiring; lower PG pool default

Task 1 — LISTEN/NOTIFY publish path:
- Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real
  POST /api/v1/projects handler with a PostgresEventPublisher and asserts a
  pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact
  capability the multi-replica live test probed. Verified PASSING against live
  CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end
  to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also
  pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the
  current tree.
- Bound the autocommit publish (Publish* methods) with publishTimeout (5s).
  These run synchronously on the caller's (request handler) goroutine and
  acquire from the event pool; on a connection-starved instance that acquire
  could block indefinitely, stalling CRUD and silently never emitting NOTIFY.
  The timeout converts that into a logged error + dropped event (publishing is
  fire-and-forget). PublishTx (transactional path) is unaffected.

Task 2 — connection budget:
- Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a
  modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance
  scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections
  set to 100 (out of band).

* test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process)

Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior
the SQLite parity suites cannot reach. Gated by //go:build integration and
SCION_TEST_POSTGRES_URL; skips cleanly otherwise.

Coverage:
- Contention: state_version CAS race (no lost updates, >=N-1 retries, final
  version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner +
  disjoint drain), unique-key races (project slug, user email, agent slug).
- Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE
  READ no-phantom snapshot, READ COMMITTED dirty-read prevention.
- Pool: exhaustion + queued recovery, saturated pool honoring context deadline,
  long txn not starving short queries, healing after pg_terminate_backend.
- LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener
  reconnect/resume, cross-channel isolation.
- Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration.
- Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text
  non-truncation, TIMESTAMPTZ microsecond precision.
- Multi-process: forks the test binary for cross-process advisory-lock
  exclusivity and cross-process NOTIFY delivery.

Configurable concurrency via SCION_TEST_CONCURRENCY (default 10).

Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open
custom-pool clients and share a DSN with forked child processes; non-integration
stubs keep the package API stable.

* fix(db): recycle stale conns + keepalives; skip singleton tick on lock error

Stale-connection pool stalls (CloudSQL drops idle conns after ~10m):
- Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite)
  and apply SetConnMaxIdleTime on the database/sql pool.
- OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with
  TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect
  timeout, so a silently-dropped peer is detected instead of the first query
  after idle hanging on a dead socket.
- pgx event pool (events_postgres.go): set keepalives + connect timeout on
  both the pool's ConnConfig and the dedicated listener connection, plus
  MaxConnIdleTime 5m / MaxConnLifetime 30m.

Advisory-lock leader election (scheduler.go):
- A lock-acquisition error no longer falls open to running the handler
  unguarded (which would duplicate singleton work across replicas); the tick
  is skipped and retried next interval. Added regression tests.

Test harness (enttest/integrationtest):
- Accept libpq keyword/value DSNs (not just URL form) when deriving the
  ephemeral db/schema/params; add WithConnParam helper.
- Fix migration idempotency test's per-pass row-count expectation.

* fix(store): bound advisory-lock conn checkout + unlock with short timeout

TryAdvisoryLock checked a connection out of the pool and ran the unlock
on the full 55s scheduler-handler context (acquire) and an unbounded
context.Background() (release). On a pool that could not promptly serve a
healthy connection, db.Conn() blocked for the entire 55s before failing
with 'context deadline exceeded' on every tick; with several singleton
handlers firing each 60s tick, those long-blocked goroutines and their
pending pool connection requests piled up across ticks and kept the pool
jammed (checked out client-side, idle server-side).

The unbounded unlock was a second leak vector: if the held connection
died mid critical-section, ExecContext could hang forever, so conn.Close()
never ran and the connection leaked out of the pool permanently.

Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release
(pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries
next tick instead of parking a goroutine for ~55s, and so a dead
connection can never block release from freeing the conn. Lock semantics
are unchanged: cancelling the acquire context tears down only that
context, not the checked-out session that holds the lock.

* feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent)

Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema
from the removed pkg/store/sqlite store) to the consolidated Ent-backed
SQLite schema, in-process on first boot, behind an automatic backup.

pkg/ent/entc/migrate_alpha.go:
- IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the
  legacy-only agents.agent_id column (no-op for an Ent/empty/absent file).
- MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>),
  AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table
  with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then
  atomically swap the migrated file into place.
- Data-driven column mapping (created_at→created, updated_at→updated,
  agents.agent_id→slug, policies→access_policies); bespoke SQL for the
  group_members/policy_bindings polymorphic splits and surrogate ids;
  groups.parent_id→group_child_groups edge.
- Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal
  signing-key secrets; plugin runtime-broker ids) with consistent rewrite
  of every foreign-key reference via a TEMP _id_remap table.
- Tolerates missing legacy tables (older schema versions).

cmd/server_foreground.go: detect + migrate in initStore's sqlite path,
with a --no-auto-migrate operator opt-out (cmd/server.go).

Validated end-to-end against four production hub.db files (scion-integration,
-integration2, -demo, -gteam): exact row-count parity (up to ~19k rows),
every entity reads back through the live Ent store, idempotent re-runs, and
broker FK references resolve post-remap. Pre-existing dangling agent
created_by/owner_id refs are faithfully preserved (loader runs FK-off).

* fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool)

The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the
value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only
bumped postgres to a real pool when the value was <= 0, but a postgres
deployment configured via env/driver override inherits the embedded default
of 1, so the guard never fired and the Ent pool ran with a SINGLE connection.

Effect in production (both integration hubs): every singleton scheduler tick
checks out the lone pool connection to hold its advisory lock, then blocks
waiting for a second connection to do its work — a self-deadlock that resolves
only at the 55s handler context deadline. All API requests serialize behind
the one connection, so GET /api/v1/* served in ~55s across the board.

Note env overrides could not paper over this: envKeyToConfigKey splits on
every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to
database.max.open.conns, not database.max_open_conns — silently ignored.

Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool
default (10) applies; explicit sizing of 2+ is still respected. SQLite remains
pinned to 1. Adds regression tests for all three cases.

* feat(hub): per-process instanceID on Server (B1-1)

Add a unique per-process instanceID to Server, generated at construction
via uuid.NewString(). Optionally prefixed with POD_NAME env var for
log readability, but uniqueness is always guaranteed by the UUID.

This ID serves as the affinity key for broker dispatch (design §4.1)
and is intentionally distinct from config.ResolveHubID, which is
shareable across replicas.

* feat(schema): affinity columns on runtime_brokers (B1-2)

Add 3 nullable fields to the runtime_brokers ent schema and store model
for tracking which hub instance holds the control-channel socket:

  - connected_hub_id     (TEXT, optional/nullable)
  - connected_session_id (TEXT, optional/nullable)
  - connected_at         (TIMESTAMPTZ, optional/nullable)

Dialect-neutral (no Postgres-only annotations) — AutoMigrate works on
both SQLite and CloudSQL Postgres per postgres-strategy.md §6.4.

Wire the fields through the ent<->store conversion code in both
directions (entBrokerToStore, CreateRuntimeBroker, UpdateRuntimeBroker).
Regenerated ent code included.

* feat(store): Claim/Release runtime-broker affinity CAS methods (B1-3)

Mirrors UpdateRuntimeBrokerHeartbeat's lock_version CAS loop.
- ClaimRuntimeBrokerConnection: newest-wins, sets affinity + status=online + heartbeat in one write
- ReleaseRuntimeBrokerConnection: compare-and-clear, returns cleared=false (no-op) if affinity moved (disconnect-race fix)
Tests cover claim/overwrite/clear/no-op + A->B flap (design 9.4).

* fix(hub): thread sessionID through connect + fix onDisconnect clobber race (B1-4, B1-5)

B1-4: HandleUpgrade returns sessionID; markBrokerOnline(brokerID, sessionID)
  now calls ClaimRuntimeBrokerConnection(brokerID, instanceID, sessionID),
  recording affinity + online + heartbeat in one CAS write.
B1-5: SetOnDisconnect callback gains sessionID; the handler compare-and-clears
  via ReleaseRuntimeBrokerConnection and skips the offline stamp when affinity
  has moved (flap). removeConnection now only removes/fires for the matching
  session, so an old connection's teardown can't drop a newer live socket.

* feat(schema): broker_dispatch intent table + messages dispatch-state (B2-1, B2-2)

B2-1: new BrokerDispatch ent entity (table broker_dispatch) — id, broker_id,
  agent_id(null), agent_slug, project_id(null), op, args(JSON), state, result,
  claimed_by, attempts, error, created_at/updated_at, deadline_at(null);
  index (broker_id,state). store.BrokerDispatch model + state constants.
B2-2: messages.dispatch_state (default 'pending') + dispatched_at; wired through
  store.Message + entadapter conversion/create. Dialect-neutral.

* feat(hub): PostgresCommandBus LISTEN/NOTIFY signal listener on scion_broker_cmd (B2-4)

Introduce a CommandBus interface and PostgresCommandBus implementation
that listens on the new global channel scion_broker_cmd for broker
dispatch wakeup signals. This is a sibling of PostgresEventPublisher,
reusing the same connect/reconnect/keepalive helpers but maintaining
its own independent pgx connection and pool (design §5.1).

Key components:
- PostgresCommandBus: LISTEN loop with backoff-reconnect on its own
  dedicated connection; filters signals by local broker ownership via
  an injected ownsLocally func (wired to ControlChannelManager.IsConnected);
  invokes an injected onSignal reconcile callback (to be wired to the
  reconcile drain in B2-5).
- NotifyBrokerCmd: issues NOTIFY inside the caller's transaction so the
  signal commits atomically with the durable intent row (mirrors PublishTx).
- NoopCommandBus: safe no-op for the SQLite backend (single-process,
  all brokers are local).
- Backend selection in newCommandBus mirrors newEventPublisher: Postgres
  driver → PostgresCommandBus; otherwise → NoopCommandBus.
- Server.SetCommandBus/CommandBus() setter/getter; cleanup in both
  Shutdown and CleanupResources paths.

* feat(store): BrokerDispatch store methods + message dispatch CAS (B2-3)

BrokerDispatchStore: Insert/Claim(CAS pending->in_progress)/Complete/Fail/
ListPendingDispatch + MarkMessageDispatched(CAS)/ListPendingMessages (via agent
runtime_broker_id). Wired into CompositeStore + store.Store. Tests: concurrent
claim single-winner (exactly-once), drain pending-only, message CAS dedupe,
complete/fail transitions, pending-messages-by-broker-agent.

* feat(hub): reconcile-on-connect drain wired to bus + markBrokerOnline (B2-5)

Server.reconcileBroker drains pending broker_dispatch rows (CAS-claim -> exec ->
done/fail) and pending messages (CAS MarkMessageDispatched -> deliver) for a
broker this node owns. Exactly-once via store CAS; idempotent + concurrent-safe.
Wired as durability backstop into markBrokerOnline (async on reconnect) and as
the command-bus signal handler (SetOnSignal -> ReconcileBroker). Op executors are
seams (executeDispatch/deliverMessage) that Phase 3/4 fill with local tunnel ops.

* feat(hub): route() decision in HybridBrokerClient (B3-1)

routeLocal (IsConnected, unchanged fast path) | routeForward (affinity owner
alive) | routeHTTP (broker endpoint set) | routeUndeliverable. Affinity is a
hint only (StoreAffinityLookup over connected_hub_id + last_heartbeat freshness),
injectable for testing. Not yet wired into dispatch (B3-2 wires message path).
Table-driven tests over all branches incl. local-precedence + nil-affinity.

* feat(hub): cross-node message dispatch via route()+intent+signal+owner drain (B3-2, B3-3)

Route-gate the message send path: HybridBrokerClient.MessageAgent now uses
route(brokerID, endpoint) to decide delivery. routeLocal and routeHTTP follow
existing paths unchanged. routeForward/routeUndeliverable return
ErrMessageDeferred — the message row (already persisted with
dispatch_state=pending) is the durable intent. All call sites
(handleAgentMessage, set[], broadcastDirect, messagebroker, notifications,
scheduler) catch the sentinel, emit a best-effort NOTIFY wakeup via
SignalBrokerCmd, and return 202 Accepted (or log as deferred).

Fill the deliverMessage seam in reconcile.go: resolves the agent from the
message's AgentID, obtains the dispatcher, and calls DispatchAgentMessage for
local tunnel delivery. reconcileBroker already CAS-marks dispatched before
calling this.

Wire SetAffinityLookup(StoreAffinityLookup(store, 0)) on the
HybridBrokerClient in CreateAuthenticatedDispatcher so route() can return
routeForward when another node owns the broker.

Add SignalBrokerCmd to the CommandBus interface — a best-effort NOTIFY using
the bus's own pool, used by the message path where the durable intent is the
message row itself and the NOTIFY is only a wakeup hint.

* feat(hub): lifecycle dispatch (rolling-timeout wait + cross-node start/stop/restart) (B4-1, B4-2)

B4-1: Rolling-timeout wait helper (dispatch_wait.go)
- waitForAgentTransition subscribes to agent.<id>.status events and loops
  with a rolling window (dispatchRollingTimeout=90s) that resets on ANY
  AgentStatusEvent (phase/activity/detail change).
- Terminal phase → return phase, nil. Window expiry → ErrDispatchFailed.
  Context cancellation → ctx.Err().
- Caller subscribes BEFORE writing intent, passes the channel + unsub.

B4-2: Cross-node start/stop/restart dispatch
- Route-gated HybridBrokerClient.StartAgent/StopAgent/RestartAgent exactly
  like MessageAgent: routeLocal → control-channel tunnel (unchanged fast
  path), routeHTTP → HTTP fallback, routeForward/routeUndeliverable →
  ErrLifecycleDeferred.
- Dispatch args structs (dispatch_args.go): StartDispatchArgs captures
  task, resolvedEnv, resolvedSecrets, inlineConfig, sharedDirs,
  sharedWorkspace, projectPath, projectSlug, harnessConfig.
  RestartDispatchArgs captures resolvedEnv. StopDispatchArgs is empty.
  All JSON-serializable for broker_dispatch.args column.
- Owner-side executeDispatch (reconcile.go): start/stop/restart cases
  deserialize args, load agent from store, call local
  DispatchAgentStart/Stop/Restart via the dispatcher. Unknown ops
  (delete, finalize_env, etc.) still fail cleanly for B4-3/B4-4.

Tests: waitForAgentTransition (terminal, error, rolling reset, silence
expiry, context cancel, unsub); route-gating of Start/Stop/Restart
returns ErrLifecycleDeferred when non-local; executeDispatch lifecycle
cases invoke the local dispatcher; args round-trip (serialize→deserialize)
is lossless; reconcile end-to-end lifecycle path.

* feat(hub): wire originator-side cross-node lifecycle dispatch (B4-2 complete)

The originator-side orchestration was missing: ErrLifecycleDeferred was
returned by HybridBrokerClient but nothing caught it. Now the full
cross-node start/stop/restart flow works transparently to all handler
call sites.

Originator side (HTTPAgentDispatcher):
- DispatchAgentStart/Stop/Restart catch ErrLifecycleDeferred after
  env/secret resolution and invoke deferredLifecycle:
  1. Subscribe("agent.<id>.status") BEFORE writing intent
  2. InsertBrokerDispatch{op, agent_id, broker_id, args}
  3. Best-effort SignalBrokerCmd (row is durable backstop)
  4. waitForAgentTransition with terminal set per op
  5. Return nil on success, error on error-phase/timeout
- SetCrossNodeDeps(events, commandBus) wired in server.go's
  getOrCreateDispatcher, so all handler call sites get cross-node
  for free with synchronous semantics preserved.
- Local path (routeLocal) is unchanged at zero added latency — no
  subscribe, no intent row, no wait.

Args decision: owner RE-RESOLVES env/secrets via DispatchAgentStart
(all hub instances share the same store + secret backend), so
StartDispatchArgs carries only {Task}. RestartDispatchArgs and
StopDispatchArgs are empty. This avoids serializing potentially large
env/secrets into the DB while remaining correct because all hubs read
from the same shared store.

waitForAgentTransition refactored to a standalone function (no Server
receiver) so the dispatcher can call it directly.

Tests:
- TestDeferredStart_WritesIntentAndWaits: deferred start writes a
  broker_dispatch row, waits, returns success on "running" event
- TestDeferredStart_ReturnsErrorOnErrorPhase: error phase → error
- TestLocalStart_SkipsIntentRow: local path calls tunnel directly,
  no intent row written
- All existing tests pass (no regressions)

* fix(hub): make web session replica-portable to fix OAuth state_mismatch

OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).

* feat(hub): cross-node delete + create-time data ops dispatch (B4-3, B4-4)

Route-gate HybridBrokerClient.DeleteAgent, CheckAgentPrompt,
CreateAgentWithGather, and FinalizeEnv through route() so
routeForward/routeUndeliverable return ErrLifecycleDeferred (matching
start/stop/restart pattern from B4-2).

B4-3 (delete dispatch):
- deferredDelete on ErrLifecycleDeferred: subscribe
  broker.dispatch.<id>.done → InsertBrokerDispatch{op:delete} →
  SignalBrokerCmd → waitForDispatchDone (reads DB row, authoritative).
- Owner executeDispatch case "delete": deserializes DeleteDispatchArgs →
  local DispatchAgentDelete (idempotent, 404 ok).
- DeleteDispatchArgs struct + UnmarshalDeleteArgs for args round-trip.

B4-4 (create-time data ops):
- deferredDataOp/deferredDataOpResult: common originator flow for ops
  that return results via the dispatch row (design §6.3). Subscribe to
  broker.dispatch.<id>.done BEFORE writing intent, insert dispatch,
  signal, waitForDispatchDone, read result from GetBrokerDispatch.
- deferredCheckPrompt: returns bool from CheckPromptResult in row.
- deferredFinalizeEnv: fire-and-forget via deferredDataOp.
- deferredCreateWithGather: returns envRequirements from row result.
- Owner executeDispatch cases: check_prompt, finalize_env, create —
  run local op, marshal result JSON, return it.
- PublishDispatchDone on EventPublisher: slim completion event
  broker.dispatch.<id>.done emitted by reconcile loop on complete/fail.
- waitForDispatchDone: event-driven wait with bounded re-read at
  rolling timeout (missed event recovery, design §6.3).
- GetBrokerDispatch added to BrokerDispatchStore interface + entadapter.

Local fast path unchanged (routeLocal → zero added latency).

* feat(hub): stale-affinity + stuck-dispatch reaper singleton (B5-1)

* feat(hub): pending-message sweep + dispatch metrics (B5-2)

Add observability for the multi-node broker dispatch pipeline:

Sweep:
- CountStuckPendingMessages store method (messages pending > threshold)
- brokerMessageSweepHandler registered as RecurringSingleton with
  LockBrokerMessageSweep (0x5C100007), runs every 1m

Metrics (pkg/observability/dispatchmetrics):
- Counters: dispatch published/claimed/done/failed, message dispatched
- Gauge: message stuck (pending beyond 5m threshold)
- Histograms: intent-to-done latency, reconcile drain duration
- Counter: command bus reconnects

Emit sites:
- InsertBrokerDispatch → IncPublished (httpdispatcher.go)
- ClaimBrokerDispatch → IncClaimed (reconcile.go)
- CompleteBrokerDispatch → IncDone + RecordDispatchLatency (reconcile.go)
- FailBrokerDispatch → IncFailed (reconcile.go)
- MarkMessageDispatched → IncMessageDispatched (reconcile.go)
- reconcileBroker → RecordReconcileDrainDuration (reconcile.go)
- command bus reconnect → IncCmdBusReconnects (command_bus.go)
- sweep handler → ObserveMessageStuck (sweep.go)

* fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop

The cookie-store fix (0515e2a8) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a8 approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.

* docs: project log for B5-3 chaos gate — GB5 PASSED (GA gate for broker dispatch)

* fix(hub): align fakeHTTPClient.CleanupProject with interface (3 params, not 4)

* fix(hub): address PR #305 review feedback

- server_migrate.go: use nil-checked deferred close for src DB, and
  explicitly close src before dropSQLiteFile to prevent Windows sharing
  violations
- server_migrate.go: handle file:// prefix before file: to correctly
  parse file:///path/to/db URLs
- server_foreground.go: evaluate GetControlChannelManager() inside the
  ownsLocally closure to avoid capturing a stale nil value
- server_migrate_test.go: add test case for file:/// URL format
- server_test.go: sanitize t.Name() slashes in newTestStore to prevent
  SQLite path errors in subtests

* docs: add project log for PR #305 review feedback fixes

* fix(hub): prevent duplicate message delivery, guard dispatch state transitions

C1: Call MarkMessageDispatched after successful local dispatch in
messagebroker.go and handlers.go (single-recipient, set[], broadcast).
Without this, successfully dispatched messages remained
dispatch_state=pending and were re-delivered on every broker reconnect
via reconcileBroker.

C2: Return immediately in messagebroker.go deliverToAgent when
CreateMessage fails — without a durable row, a deferred signal has
nothing for the owning node to reconcile.

C3: Guard CompleteBrokerDispatch and FailBrokerDispatch with
state=in_progress CAS predicate so a done dispatch cannot be flipped
to failed or vice versa. Update tests to claim before completing/failing
to match the new CAS guard.

* fix(hub): reconcile broker→eventbus and hub-native→hub-managed renames after rebase

Post-rebase fixups to align the feature branch with main's refactoring:
- broker package → eventbus package rename (types, imports, methods)
- SetRecipient → GroupRecipient, SetMessageResponse → GroupMessageResponse
- hubNativeProjectPath → hubManagedProjectPath
- ProjectTypeHubNative → ProjectTypeHubManaged
- populateAgentConfig gains ctx parameter
- Add missing handleResourcesImport and handleMessageChannels handlers
- Add ListChannels method to MessageBrokerProxy
- Wire newCommandBus in server_foreground.go
- Restore main's test fixtures for renamed APIs

---------

Co-authored-by: scion-gteam[bot] <271067763+scion-gteam[bot]@users.noreply.github.com>
Co-authored-by: Scion <agent@scion.dev>
…GoogleCloudPlatform#303)

* fix: atomic session-guarded broker disconnect to prevent reconnect race (GoogleCloudPlatform#131)

The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection
and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects
rapidly, the stale disconnect's offline stamp can clobber the new connection's
online status because UpdateRuntimeBrokerHeartbeat has no session guard — it
unconditionally overwrites status. Provider statuses are also clobbered and never
restored by heartbeats, leaving the broker permanently invisible until hub restart.

Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps
status=offline in a single CAS write. If a concurrent reconnect has already
claimed the broker with a new session, the compare fails and the callback is
a no-op. Also add a re-check guard before updating provider statuses.

* docs: add project log for broker disconnect race fix unification
…rm#301)

* docs(design): reduced resource clone/delete design (resolved review)

* refactor: remove dead Locked field from Template and HarnessConfig models

Remove the Locked bool field, all 16 enforcement sites across 6 handler
files, the force query parameter from delete endpoints, 3 locked-template
tests, and add a DB migration to drop the column. No production code ever
set Locked=true — this simplifies the handlers for the upcoming clone/delete
feature.

* feat: add harness-config clone endpoint, authz hardening, and slug uniqueness

- Add handleHarnessConfigClone mirroring template clone
- Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone
- Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id)
- Return 409 Conflict on slug collision during clone
- Add clone failure cleanup
- Add tests for clone, authz, and slug collision

* feat(web): add Clone/Delete row actions and clone-from-global to resource list

- Add Clone and Delete action menu to shared resource-list component
- Add delete confirmation dialog with deleteFiles checkbox (default on)
- Add clone dialog with name input and 409 collision handling
- Add clone-from-global picker in project settings view
- Unify on resource-changed event (migrate resource-imported)
- Gate actions on capabilities (canClone, canDelete properties)

* fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method

- Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails
  after files were already copied (prevents orphaned storage files)
- Remove redundant confirmCloneFromGlobal method — confirmClone already
  handles cross-scope clone via the component's scope/scopeId properties

* fix: adapt Locked removal and slug constraint to Ent-based schema

Remove Locked references from entadapter, remove stale sqlite.go
(replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id)
to Ent schema indexes, and regenerate Ent code.

* fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked)

- Use api.NewUUID() for all test entity IDs (Ent enforces UUID format)
- Remove Locked field from entadapter create/update calls
- Remove stale sqlite.go (replaced by Ent ORM upstream)
- Add UNIQUE(slug, scope, scope_id) to Ent schema indexes
…form#309)

* fix(hub): make web session replica-portable to fix OAuth state_mismatch

OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).

* fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop

The cookie-store fix (0515e2a) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.

---------

Co-authored-by: Scion <agent@scion.dev>
…events (GoogleCloudPlatform#312)

A rapid session.start → session.end sequence from a spurious sciontool
could permanently reset an agent's phase even while the agent works
normally. This adds two guards:

1. Phase regression guard: rejects transitions that would move an agent
   backward in its forward-progress lifecycle (e.g. running → starting)
   in both the status update handler and broker heartbeat handler.

2. Activity-driven phase auto-correction: when an activity that implies
   the agent is running (working, thinking, executing, etc.) arrives but
   the phase is pre-running, auto-promotes the phase to running.

Fixes GoogleCloudPlatform#124
…GoogleCloudPlatform#313)

Also unset SCION_PROJECT_ID when clearing hub context env vars, since
IsHubContext() checks all four env vars and a leftover SCION_PROJECT_ID
causes FindProjectRoot() to return a synthetic path instead of failing.
…tform#311)

* Fix agent list task overflow and unify action buttons

Task cell in list view used inline span styling that silently ignored
max-width/overflow constraints, allowing long task text to push action
buttons off-screen. Switch to display:-webkit-box with line-clamp:2
so text wraps to at most two lines with ellipsis.

Card view action buttons now render icon-only (matching list view),
with sl-tooltip and aria-label for accessibility. Both views share a
single renderActionButtons helper, eliminating the duplicated button
logic. Color-coded hover effects added to action buttons in both
views: red for stop/delete, amber for suspend, green for resume/start.

Closes GoogleCloudPlatform#134
Closes GoogleCloudPlatform#135

* Fix agent list task overflow and unify action buttons

Task cell in list view used inline span styling that silently ignored
max-width/overflow constraints, allowing long task text to push action
buttons off-screen. Switch to display:-webkit-box with line-clamp:2
so text wraps to at most two lines with ellipsis.

Card view action buttons now render icon-only (matching list view),
with sl-tooltip and aria-label for accessibility. Both views share a
single renderActionButtons helper, eliminating the duplicated button
logic. Color-coded hover effects use translucent rgba backgrounds
that work in both light and dark mode: red for stop/delete, amber
for suspend, green for resume/start.

Closes GoogleCloudPlatform#134
Closes GoogleCloudPlatform#135

* Add before/after screenshots for PR review

Screenshots captured from the real running app (Vite dev server +
fetch mock for agent data). Shows before/after for both issues in
light mode and dark mode.

* Fix hover on disabled buttons and tooltip on disabled terminal

Add :not([disabled]) to hover CSS selectors so color-coded hover
effects don't apply to disabled action buttons. Wrap the Terminal
button in an inline-flex span inside sl-tooltip so the tooltip
remains accessible even when the button has pointer-events:none.
* docs(design): auth proxy mode (Google IAP) architecture

Add design for an exclusive proxy human-auth mode that derives the user from
a verified Google IAP signed header (X-Goog-IAP-JWT-Assertion), reusing the
existing domain/allowlist/admin provisioning controls. Also specifies a
hub-minted transport-auth layer (dedicated SA, generalizing PR GoogleCloudPlatform#307) so agents
can traverse the IAP / Cloud Run-invoker front door, with a generalized
array-based token refresh.

* refactor(hub): extract provisionUser, dedupe OAuth find-or-create

Extract the duplicated find-or-create-user block from four OAuth
handlers (handleAuthLogin, handleAuthToken, handleCLIAuthToken,
completeOAuthLogin) into a single provisionUser method on Server.

The new method encapsulates:
1. Authorization check (isUserAuthorized) with audit logging
2. GetUserByEmail / CreateUser (find-or-create)
3. Profile backfill (DisplayName, AvatarURL when empty)
4. Admin promotion (when admin list changes)
5. Hub membership enrollment (ensureHubMembership)

Introduces ExternalUserInfo struct (decoupled from OAuthUserInfo) and
ErrAccessDenied sentinel error for caller-side HTTP response mapping.

This is Phase 0 of the auth-proxy-mode feature — pure refactor with
no behavior change. The proxy middleware (Phase 1) will call the same
provisionUser method.

NOTE: No suspended-user check is added. The existing OAuth flow does
not check user.Status == "suspended" either; adding it here would
change behavior. This gap is documented for Phase 1.

* docs(project-log): record provisionUser extraction findings

* feat(auth): implement proxy auth mode with IAP JWT verification (Phase 1)

Add exclusive proxy auth mode for Google IAP signed-header authentication:

- pkg/hub/proxyauth.go (NEW): ProxyAuthenticator interface, IAPAuthenticator
  with ES256 JWT verification via go-jose/v4, JWKS lazy-fetch cache with
  periodic refresh + on-miss refresh for unknown kids + transient failure
  tolerance (last-good keys).

- pkg/config: auth.mode selector (oauth|proxy|dev), auth.proxy section with
  provider/iap.audience/overrides in both DevAuthConfig (GlobalConfig) and
  V1AuthConfig (settings.yaml). Wire conversion in both directions.

- pkg/hub/auth.go: Replace IP-only extractProxyUser branch with
  ProxyAuthenticator path. Add 60s resolution cache (ProxyUserCache) wrapping
  provisionUser — signature verification runs every request, only the store
  lookup is cached. Legacy extractProxyUser preserved when no authenticator
  is configured.

- pkg/hub/handlers_auth.go: Add suspended-user gate to provisionUser —
  rejects Status=="suspended" with ErrUserSuspended. This is an intentional
  behavior change sanctioned by the design doc, closing the pre-existing
  OAuth suspended-login gap documented in Phase 0.

- pkg/hub/web.go: In proxy mode, handleAuthProviders returns no OAuth
  providers; handleLogout redirects to IAP's clear_login_cookie endpoint.

- cmd/server_foreground.go: Construct IAPAuthenticator when mode==proxy &&
  provider==iap, wire into ServerConfig.ProxyAuth.

Security: audience binding is mandatory; only the signed JWT assertion is
authoritative (X-Goog-Authenticated-User-* headers ignored); clock skew
±30s; JWKS cache handles key rotation and transient fetch failures.

* test(auth): add comprehensive IAPAuthenticator unit tests

Tests using self-generated ES256 key pair + httptest JWKS server:
- Valid assertion -> correct ProxyUserInfo (subject/email stripped, lowercased)
- Bad signature -> error
- Wrong audience -> error (mandatory binding)
- Wrong issuer -> error
- Expired token (past 30s skew) -> error
- Missing header -> (nil, nil) fall-through
- Unknown kid triggers JWKS refresh and succeeds
- Custom issuer override for testing
- HD (hosted domain) claim extraction
- Email lowercasing
- JWKS cache transient failure tolerance (serves last-good keys)

* style: fix gofmt formatting in proxyauth_test.go and settings_v1.go

* docs(project-log): record auth-proxy-mode Phase 1 implementation

* config: add auth.transport config for outbound transport auth

Add TransportAuthConfig (hub_config.go) and V1TransportConfig
(settings_v1.go) for the transport-layer auth that lets agents
traverse IAP / Cloud Run invoker front doors. Config supports
mode (none|cloudrun_invoker|iap), oidcAudience, and
platformAuthSA fields. Wire into V1↔GlobalConfig conversion
and env key mapping.

Phase 2 item 6 of auth-proxy-mode.

* hub: add TransportTokenMinter interface and implementations

Introduce the TransportTokenMinter interface for minting Google OIDC
ID tokens that let agents traverse platform guards (IAP / Cloud Run
invoker). Three implementations:

- gcpTransportMinter: production impl using IAM Credentials API
  (generateIdToken) to impersonate a dedicated platform-auth SA.
  Uses already-vendored google.golang.org/api/iamcredentials/v1.
- noopTransportMinter: returns error when transport auth is disabled.
- FakeTransportMinter: exported test double for other packages.

Also adds RefreshTokenEntry type for the generalized tokens[] array
and parseJWTExpiry for extracting expiry from ID tokens.

All tests pass with no live GCP dependency (httptest fakes).

Phase 2 item 6 of auth-proxy-mode.

* hub: wire transport token minter into ServerConfig and dispatch

Add TransportMode, TransportAudience, TransportMinter fields to
ServerConfig and wire them through to the Server struct and
HTTPAgentDispatcher. Transport tokens are injected as env vars
(SCION_TRANSPORT_TOKEN, SCION_TRANSPORT_AUDIENCE,
SCION_TRANSPORT_TOKEN_EXPIRY) into agent dispatch payloads in
all three dispatch paths (Create, Start, Restart).

server_foreground.go constructs a gcpTransportMinter from
auth.transport config, deriving audience from hubEndpoint
for cloudrun_invoker mode.

When transport mode is "none" or unset, no minter is created
and no transport tokens are injected — zero impact on existing
deployments.

Phase 2 item 6 of auth-proxy-mode.

* hub: extend token refresh response with generalized tokens[] array

The agent token refresh handler now returns a tokens[] array
alongside the existing token/expires_at fields for backward
compatibility. Old clients ignore tokens[]; new clients use it
to apply both app-layer and transport-layer tokens.

When transport auth is configured (transportMinter != nil), the
response includes a google_oidc transport token entry with the
configured audience. When disabled, only the app scion_access
entry appears.

Transport token minting errors are logged but don't fail the
refresh — the app token is always returned.

Phase 2 item 7 of auth-proxy-mode.

* sciontool: add pluggable OIDC transport for agent outbound auth

Implement the agent-side transport-layer auth with two pluggable
token sources:

- injectedTokenSource: uses the hub-provided SCION_TRANSPORT_TOKEN
  env var (cold start), then refreshed via the tokens[] array on
  subsequent refresh calls.
- metadataTokenSource: fetches OIDC from the GCE metadata server
  (passthrough/on-GCE mode, the PR GoogleCloudPlatform#307 pattern).

Selection logic: SCION_TRANSPORT_TOKEN env → injected mode;
else if on GCE → metadata mode; else → no OIDC transport.

The oidcTransport RoundTripper injects Authorization: Bearer on
outbound hub requests. Graceful degradation: if token fetch fails,
the request proceeds without the header (the hub can still auth
via X-Scion-Agent-Token).

Client changes:
- Add oidcSource field and configureOIDCTransport() in NewClient()
- Update RefreshTokenResponse with tokens[] array (backward compat)
- RefreshToken() applies transport tokens via applyRefreshTokens()
- Refresh scheduling uses shortest-lived entry (5-min margin for
  transport tokens vs 2h for scion tokens)

23 new tests covering both sources, transport, configuration,
end-to-end dual-header, and refresh token application.

Phase 2 item 8 of auth-proxy-mode.

* docs(project-log): record auth-proxy-mode Phase 2 implementation

* docs: add IAP proxy auth deployment guide (Phase 3)

Add comprehensive deployment documentation for the IAP + Cloud Run
invoker topology, covering inbound human IAP authentication,
outbound agent transport auth (dual-layer OIDC + scion token),
security considerations, and an end-to-end GCP setup checklist.
All config keys and env vars verified against shipped code.

* fix: prevent JWKS cache stampede and add HTTP client timeout

- resolveHTTPClient() now returns a client with 10s timeout instead of
  http.DefaultClient (which has no timeout), preventing hangs on JWKS fetches.
  Tests that inject their own HTTPClient are unaffected.

- JWKS cache refresh now debounces on lastAttempted (set at the start of
  every attempt, success or failure) instead of lastFetched (success only).
  This prevents stampedes during persistent JWKS outages where every
  cache-miss would trigger an unbounded refresh.

- Added a refreshing guard to prevent concurrent in-flight refreshes
  (proactive background refresh + synchronous miss-refresh could race).

- Network I/O is now performed outside the write lock to avoid holding
  the mutex across HTTP requests.

- Added TestJWKSCache_StampedePreventionDuringOutage to verify that
  repeated misses during an outage do not cause repeated fetches within
  the debounce window.

* fix: replace custom splitJWT with strings.Split and cache IAM service

- Replace the hand-rolled splitJWT function with strings.Split(token, ".").
  Behavior is identical for well-formed JWTs; the custom function is deleted.

- Cache the IAM credentials service client in gcpTransportMinter using
  sync.Once so it is created once and reused across MintIDToken calls
  instead of creating a new HTTP client/service on every invocation.
  Uses context.Background() for the long-lived client construction;
  per-call ctx continues to be passed to .Context(ctx).Do().
  FakeTransportMinter is unaffected.
…oogleCloudPlatform#302)

* fix: resolve workspace file browser to groves/ instead of projects/

The Hub UI file browser was showing the wrong directory contents. The
hubManagedProjectPath() function resolved workspace paths to
~/.scion/projects/<slug>/ (project metadata) instead of
~/.scion/groves/<slug>/ (the actual git checkout mounted as /workspace
in agents).

Reverse the lookup priority: check groves/ first, fall back to
projects/, and default to groves/ when neither has content.

Fixes GoogleCloudPlatform#130

* docs: add project log for issue GoogleCloudPlatform#130 workspace path fix

* fix: guard hubManagedProjectPath against empty slug

Prevent hubManagedProjectPath from resolving to the parent directory
when called with an empty slug. Add unit test for this case.
…by/owner_id)

The Agent Ent schema modeled created_by/owner_id as foreign keys to the
users table. When an agent creates a sub-agent, those columns hold the
*creating agent's* ID, which has no users-table row, so Postgres rejected
the insert with a foreign-key violation. mapError maps that to
ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid
input (status: 400)" on every agent-initiated `scion start`. User-created
agents were unaffected, masking the regression (introduced when GoogleCloudPlatform#304
ported the agent store onto Ent).

created_by/owner_id are polymorphic principal references (user OR agent),
like ancestry. Drop the User-typed edges and keep them as plain principal
UUID fields; resolve the delegation creator by ID and tolerate "no such
user". Atlas AutoMigrate drops the two FK constraints on existing DBs at
next boot.

Tests: the sole sub-agent creation test only passed because it seeded a
fake user row sharing the agent's ID — an impossible production state.
Remove that workaround so it exercises the real path, and add store/ent
regression tests asserting a non-user principal ID is accepted.
…o agent containers (GoogleCloudPlatform#322)

* Add sciontool doctor and agent auth reset infrastructure

When an agent's hub JWT expires and the refresh loop fails (e.g. hub
signing key rotation), the agent becomes a zombie: running locally but
invisible to the hub. This adds two features to diagnose and recover:

1. `sciontool doctor` command — runs inside the agent container to check
   env vars, token validity/expiry, hub connectivity, auth status, and
   GCP metadata/GitHub token health. Prints actionable remediation.

2. Auth reset mechanism — allows pushing a fresh token into a running
   agent without restarting. The flow is:
   - Hub generates a new agent JWT via DispatchAgentResetAuth
   - Broker's /reset-auth endpoint writes the token file via exec
   - Broker sends SIGUSR2 to sciontool init (PID 1)
   - Init re-reads the token, updates the hub client, restarts the
     token refresh loop, and sends an immediate heartbeat

Also adds Client.SetToken() for in-memory token updates.

* Add scion reset-auth CLI command and hub API endpoint

Adds the user-facing `scion reset-auth <agent>` command that triggers
an auth reset on a running agent via the Hub. Also adds:
- Hub handler for POST /api/v1/agents/{id}/reset-auth
- hubclient AgentService.ResetAuth() method

---------

Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
Adds a "Reset Auth" button in the agent detail header actions area,
visible when the agent is running. Clicking it calls the hub's
POST /api/v1/agents/{id}/reset-auth endpoint, which generates a
fresh JWT and pushes it into the running container without restart.
GoogleCloudPlatform#323)

* Make SIGUSR2 signal best-effort in reset-auth handler

The kill -USR2 step can fail (e.g. PID 1 is not sciontool init, or
the process doesn't handle the signal). Since the token file write
already succeeded and the refresh loop will pick up the new token
without the signal, treat signal failure as a warning rather than
returning a 500 error.

* Add admin bulk reset-auth endpoint

POST /api/v1/admin/agents/reset-auth-all lists all running agents and
dispatches an auth reset for each, returning a per-agent success/failure
summary. Admin role required.

* Add Reset Auth All button to admin maintenance page

Adds a Quick Actions section with a "Reset Auth — All Running Agents"
button that calls POST /api/v1/admin/agents/reset-auth-all and displays
a per-agent success/failure summary inline.

---------

Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
ptone and others added 27 commits June 7, 2026 06:30
…gleCloudPlatform#340)

The _linkedDiscordId @State() property was declared and assigned but
never read in the template, causing TS6133 and failing the Verify Web
Types CI step.
…udPlatform#342)

* feat(agent-viz): add Agent Communications transcript panel

Render a scrolling, human-readable transcript of inter-agent messages
alongside the force-graph. The panel consumes the same `message` playback
events that already drive the on-graph pulse lines, so it stays in sync
with playback and is rebuilt from the snapshot on seek — no extra data
source or backend change required.

Broadcasts are highlighted and de-duplicated (by event time), timestamps
are shown relative to playback start, and the panel is collapsible.

- web/src/comms.ts:  new CommsPanel component
- web/src/main.ts:   wire panel into the message / snapshot / reset paths
- web/index.html:    panel styling
- README.md:         document the panel

* fix(agent-viz): avoid layout thrashing in CommsPanel during snapshot replay

On seek, addMessage runs in a tight loop with animate=false. Reading
scrollTop/clientHeight/scrollHeight before each appendChild forced a
synchronous reflow per message — layout thrashing that can freeze the UI
on large logs. The measurement is now skipped when not animating, and a
single scroll-to-bottom is deferred to requestAnimationFrame after the
replay loop (tracked and cancelled on reset).
Move the entire Provision body into a vendor-agnostic free function
provisionShared(in ProvisionInput) error in workspace_provision.go.

Relocate helpers as free functions (drop nfsBackend receiver):
- acquireProvisionLock, gitCloneWorkspace, ensureWorktree,
  chownProjectTree, resolveUID, resolveGID
- Constants: provisionSentinelFile, provisionLockRetries,
  provisionLockRetryDelay
- Free helpers: writeSentinel, sanitizeBranchName (already free,
  just relocated for cohesion)

resolveUID/resolveGID now default NFSUID/NFSGID → 1000 without
reading cfg. Callers that need the cfg fallback should apply
cfg defaults to ProvisionInput before calling provisionShared.

Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
- Delete Provision method from WorkspaceBackend interface
- Delete localBackend.Provision (was a no-op)
- Delete nfsBackend.Provision and all moved method definitions
  (acquireProvisionLock, gitCloneWorkspace, ensureWorktree,
  chownProjectTree, resolveUID, resolveGID, writeSentinel,
  sanitizeBranchName, constants) — now live in workspace_provision.go
- Update interface doc comment to reflect Resolve+Realize only,
  with a note that provisioning is now the standalone provisionShared

Provision had NO production callers (only tests called it).

Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
- Change all b.Provision(...) calls to provisionShared(...) in
  workspace_provision_test.go
- Replace TestLocalBackendProvision_NoOp (deleted method) and
  TestNFSBackendProvision_NonGit with TestProvisionShared_NonGit
  in workspace_backend_test.go
- Update TestAcquireProvisionLock_ContextCancellation in
  k8s_nfs_test.go to call free function acquireProvisionLock
- All assertions remain identical; tests pass green

Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
Add comments near MountDescriptor.Type and api.VolumeMount.Type
noting that future vendor mount types (e.g. "cloudrun-volume",
"gke-shared-volume") will be added as new Type values, while "nfs"
remains the literal NFS protocol mount. No new enum values or
behavior change.

Part of GoogleCloudPlatform#169: Storage provisioning Phase 0.
Provision was removed from the WorkspaceBackend interface in this PR; the
comment still referenced it. Review finding from PR GoogleCloudPlatform#170 (cosmetic).
Rename the unexported provisionShared function to ProvisionShared so that
the new sciontool provision subcommand (in cmd/) can call it. Update all
in-package callers and tests. No behavior change.

Part of GoogleCloudPlatform#169 storage provisioning PR2.
In k8s init containers only the workspace dir is mounted (PVC subPath),
not its parent. The sentinel file must therefore be written inside the
workspace dir rather than its parent. Add ProvisionInput.SentinelDir:
when empty, defaults to filepath.Dir(HostPath) preserving existing
broker-side behavior; when set, the sentinel is placed there instead.

Add tests proving default placement is unchanged and custom SentinelDir
is honored with idempotency.

Part of GoogleCloudPlatform#169 storage provisioning PR2.
Add a new `sciontool provision` subcommand that replaces the bespoke
shell scripts previously used in NFS workspace init containers.

Clone mode (default): reads SCION_CLONE_URL and SCION_CLONE_BRANCH from
env vars (injection-safe), builds a ProvisionInput with SentinelDir set
to the workspace path (required because only the workspace subPath is
mounted), and calls runtime.ProvisionShared. Idempotent via sentinel.

Wait mode (--wait-for-sentinel): polls for the sentinel file at 2s
intervals with a 300s timeout, replicating nfsWaitForSentinelScript.

Also exports ProvisionSentinelFile constant for use by the CLI.

Part of GoogleCloudPlatform#169 storage provisioning PR2.
Replace the sh -c initScript command with a direct sciontool provision
invocation:
- Lock winner: sciontool provision --depth <N>
- Lock loser: sciontool provision --wait-for-sentinel

URL and branch continue to be passed via env vars (SCION_CLONE_URL,
SCION_CLONE_BRANCH) for injection safety. Depth is a numeric flag.
Init container name, image, securityContext, and volumeMounts unchanged.
Broker-side lock winner/loser selection unchanged.

Tests will be updated in the next commit to assert the new command format.

Part of GoogleCloudPlatform#169 storage provisioning PR2.
Remove nfsInitProvisionScript, nfsWaitForSentinelScript, and
nfsInitProvisionEnv — replaced by nfsProvisionCommand and nfsProvisionEnv
which build sciontool provision invocations instead of shell scripts.

Update all k8s_nfs_test.go assertions: instead of checking shell script
text for git/sentinel/env-var references, tests now verify that:
- Winner init container runs: sciontool provision --depth <N>
- Loser init container runs: sciontool provision --wait-for-sentinel
- URL/branch are passed via env vars, never in command args
- Winner/loser selection, env injection safety, and no-clone-config
  cases are all preserved with equivalent coverage.

Part of GoogleCloudPlatform#169 storage provisioning PR2.
Move ProvisionShared, its helpers, and the ProvisionInput/ResolvedWorkspace/
ResolvedSharedDir types from pkg/runtime to a new pkg/provision package
that depends only on stdlib + pkg/api + pkg/store (no pkg/config).

This fixes TestInitProjectDataIsolation: the lean sciontool binary can now
import pkg/provision for its `provision` subcommand without transitively
pulling in pkg/config (filesystem-based project path resolution).

Backward-compatible type aliases and a thin ProvisionShared wrapper remain
in pkg/runtime so existing callers (nfsBackend, broker) keep compiling.

Tests for moved symbols (sanitizeBranchName, writeSentinel, validation,
acquireProvisionLock context cancellation) moved to pkg/provision.
Integration tests using nfsTestBackend stay in pkg/runtime.
…ntainer

In a k8s init container only the workspace dir is mounted (subPath), so
filepath.Dir(HostPath) resolves to "/" for HostPath=/workspace — the chown
would recurse over the entire container root (chown -R 1000:1000 /). Today it's
masked by dropped capabilities + a non-root security context (chown fails
silently, provisioning still succeeds), but it's a correctness defect and a
latent security hazard if that security context is ever relaxed.

Extract the target computation into a testable chownTarget() helper that falls
back to the workspace dir itself when the parent is "/" or ".", preserving
broker-side behavior (chown the project root). Add unit coverage.

Review finding from PR GoogleCloudPlatform#172 (medium severity, required fix).
…o 'ProvisionShared:'

The Tier-1 body moved out of nfsBackend into the standalone ProvisionShared
function, but its log/error message prefixes still said 'nfsBackend.Provision:',
which is misleading now that it has real (non-NFS) callers like sciontool.
Pure string change, no logic change.

Review finding from PR GoogleCloudPlatform#170 (cosmetic), applied in PR GoogleCloudPlatform#172 where the code lives.
Address gemini-code-assist review on GoogleCloudPlatform#344:

- Add optional ProvisionInput.Ctx (defaults to context.Background() when nil,
  so ProvisionShared's signature is unchanged for existing callers).
- Thread ctx through gitCloneWorkspace, ensureWorktree, and chownProjectTree,
  switching their git/chown invocations to exec.CommandContext so a cancelled
  or timed-out context kills the child process instead of orphaning it.
- gitCloneWorkspace now self-heals an incomplete prior clone: when the target
  dir is non-empty with no .git, clear its contents (removeDirContents keeps
  the dir, which may be a k8s mount point) and retry the clone once.
- sciontool 'provision': import context, pass cmd.Context() into runProvision
  and runWaitForSentinel; set ProvisionInput.Ctx; make the sentinel poll loop
  select on ctx.Done()/time.After (was time.Sleep) and use time.Since.
- Add TestProvisionCmd_WaitForSentinel_ContextCancel covering prompt
  cancellation of the poll loop.
…udPlatform#346)

* fix(ci): handle unchecked error returns flagged by errcheck

Explicitly discard error returns from os.Remove (cleanup in
error paths) and os.WriteFile (test helper) so golangci-lint's
errcheck linter passes.

* style(ci): apply gofmt to unformatted source files

Run gofmt on discord plugin and hub client files that were
failing the CI format check.
…tform#169)

Add Tier-3 vendor mount types to both MountDescriptor (runtime-level)
and api.VolumeMount (config-level) discriminators. Each new type has
its own validation requirements:
- cloudrun-volume: requires volume_name (Cloud Run managed volume)
- gke-shared-volume: requires volume_name (GKE Filestore CSI PVC)

"nfs" remains the literal NFS protocol mount (server + export only).

MountDescriptor gains a VolumeName field for the new types.
api.VolumeMount gains a VolumeName field and updated Validate() switch.
Unit tests cover valid + missing-required-field cases for both new types.
…volume (GoogleCloudPlatform#169)

Introduce two new WorkspaceBackend implementations:
- cloudrunVolumeBackend: emits MountDescriptor Type "cloudrun-volume"
  with VolumeName and SubPath for Cloud Run managed volumes.
- gkeSharedVolumeBackend: emits MountDescriptor Type "gke-shared-volume"
  with VolumeName, PVClaimName, and SubPath for GKE Filestore CSI PVCs.

Add config types V1CloudRunVolumeConfig and V1GKESharedVolumeConfig to
V1WorkspaceStorageConfig. Extend SelectWorkspaceBackend to route
"cloudrun-volume" and "gke-shared-volume" backend values (SharedPlain
and WorktreePerAgent modes; ClonePerAgent still escapes to local).

Unit tests cover Resolve, Realize, error cases, backend selection,
and default target for all new backends.
…udPlatform#169)

Implement CloudRunRuntime satisfying the Runtime interface, registered
as "cloudrun" in factory.go GetRuntime. Key design decisions:

Broker-side direct provisioning: Cloud Run with a host-mounted share
calls provisionShared (Tier 1) DIRECTLY broker-side — no init container
needed. The runtime's Run method provisions the workspace before the
(deferred) service deployment step. For cloudrun-volume backends the
platform provisions the volume, so provisionShared is skipped.

Lifecycle methods (deploy/exec/logs via Cloud Run Admin API) return
descriptive "not yet implemented" errors — the full container lifecycle
is deferred to a follow-up PR. The provisioning and mount-realization
wiring is complete and tested.

Config: add V1CloudRunConfig (project, region) nested under
V1RuntimeConfig.CloudRun. Factory wires WorkspaceStorage from server
config for backend selection.

Tests cover: runtime name/user, config plumbing, factory selection,
broker-side provisioning with NFS (verifies directory + sentinel),
cloudrun-volume skip path, missing ProjectID, and all lifecycle
methods returning not-implemented errors.
…ionWorkspace

Document that the hardcoded shared-plain mode is intentional for the initial
Cloud Run runtime scope; per-agent worktrees are a follow-up. No logic change.

Review finding from PR GoogleCloudPlatform#171 (low severity, observation 2).
Agents that run local web servers (dev servers, preview apps, test
harnesses) need guaranteed unique host ports. This adds a port pool
managed by the runtime broker that allocates unique ports from a
configurable range at agent startup and releases them on stop/delete.

Each agent receives environment variables:
- AVAILABLE_LOCALHOST_PORT_A, _B (port numbers)
- AVAILABLE_LOCALHOST_URL_A, _B (full URLs, when host_url is configured)

Ports are published via docker -p so they're reachable from the host.

Configuration in settings.yaml:
  server:
    broker:
      port_pool:
        range: "8000-9000"
        ports_per_agent: 2
        host_url: "http://broker-host-ip"

Defaults: range 8000-9000, 2 ports per agent, disabled unless configured.
Fixes from code review:

1. Port leak on agent restart: when an agent is stopped and restarted,
   Start() allocated new ports without releasing the old ones (which
   weren't freed because the previous run didn't go through Delete).
   Fix: call Release(name) before Allocate — idempotent no-op if no
   prior allocation exists.

2. Invalid URL with trailing slash: if host_url was configured as
   "http://example.com/", the constructed URL became
   "http://example.com/:8042". Fix: trim trailing slashes on init.
… review

Debug/refactor findings:
- NewPortPool now returns error on invalid inputs (bounds, min>max, perAgent<=0)
- Allocate validates agent name and count
- Added ParsePortRange unit tests
- Added validation error tests for PortPool constructor
- Updated callers to handle new error return
- 267 lines of improvements, all tests pass
@zeroasterisk zeroasterisk force-pushed the feature/port-assignment branch from 2cb45e9 to a0d14ea Compare June 7, 2026 22:55
@zeroasterisk

Copy link
Copy Markdown
Owner Author

Closing — upstream PR GoogleCloudPlatform#298 is open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants