Skip to content

Sync#11

Open
Swiftyos wants to merge 57 commits into
Swiftyos:mainfrom
Significant-Gravitas:main
Open

Sync#11
Swiftyos wants to merge 57 commits into
Swiftyos:mainfrom
Significant-Gravitas:main

Conversation

@Swiftyos
Copy link
Copy Markdown
Owner

@Swiftyos Swiftyos commented May 5, 2026

No description provided.

Swiftyos and others added 30 commits April 17, 2026 12:18
…Phase 0

Turns the approved AgentProbe server design
(docs/design-docs/agent-probe-server.md) into a binding product contract
before any runtime code lands, per the Phase 0 exec plan
(docs/exec-plans/active/agent-probe-server-phase-0-contract-2026-04.md).

- platform.md: add "Server control plane" scenario group with 9 scenarios
  covering default boot, exposure safety, read-only HTTP/UI history, live
  SSE, run control, cancellation, presets, comparisons, and Docker SQLite
  persistence.
- current-state.md: mirror all 9 new scenarios as unchecked (planned) and
  refresh "Last validated against" to 2026-04-17.
- e2e-checklist.md: add a planned test-owner row per scenario covering
  tests/e2e/server-e2e.test.ts, tests/integration/server/,
  tests/unit/server/, and dashboard component tests.

scripts/check-behaviour-docs.ts reports zero drift across all 24 scenarios.
No runtime files modified; PR targets dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs(product-specs): Server control plane contract (SYM-20 Phase 0)
Summary:
- Add start-server config, auth, routing, static dashboard, SQLite read
  routes, suite discovery, report rendering, and SSE replay support.
- Add a dual-mode dashboard that preserves /api/state live polling and
  adds read-only server views for overview, runs, scenarios, suites, and
  settings.
- Add unit, integration, and e2e coverage for server config, auth,
  streams, HTTP read paths, token protection, and CLI lifecycle.

Rationale:
- Phase 1 needs a stable read-only HTTP surface before later write paths
  add run control, presets, cancellation, and other orchestration.
- Binding safety and token auth are enforced at boot and request edges so
  non-loopback exposure cannot accidentally start unauthenticated.

Tests:
- bun run docs:validate
- bun run test tests/unit/server
- bun run test tests/integration/server
- bun run test:e2e
- bun run dashboard:build
- bun run typecheck
- bun run lint
- manual start-server smoke with /healthz, /readyz, /api/scenarios,
  and /api/runs

Co-authored-by: Codex <codex@openai.com>
Summary:
- Add authenticated server write routes for starting and cancelling runs,
  preset CRUD, preset launch history, and frozen preset snapshots.
- Extend SQLite run history to schema v4 with preset tables, server-run
  metadata, cancellation timestamps, and WAL-enabled server connections.
- Teach runSuite to accept prepared file/id scenario selections and a
  cooperative AbortSignal while preserving existing CLI and dashboard mode.
- Add dashboard start/preset/cancel views plus Docker and Compose packaging
  with a token-protected SQLite volume deployment path.

Rationale:
- Phase 2 needs an operator workflow that can configure a run, save it as a
  reusable preset, launch it from the server UI, and observe/cancel progress.
- Scenario references now use file plus id so duplicate ids across scenario
  files remain deterministic, and write paths validate boundaries at data-root
  and bearer-auth edges.

Tests:
- bun run docs:validate
- bun run test tests/unit/server
- bun run test tests/integration/server
- bun run test:e2e
- bun run dashboard:build
- bun run typecheck
- bun run fast-feedback
- COMPOSE_PROGRESS=plain AGENTPROBE_SERVER_TOKEN=... OPEN_ROUTER_API_KEY=...
  docker compose -p agentprobe-sym22 up --build --detach
- curl -fsS http://127.0.0.1:7878/healthz
- curl -fsS -X POST http://127.0.0.1:7878/api/runs ... dry-run payload
- OPEN_ROUTER_API_KEY=... docker compose -p agentprobe-sym22-missing config
  exited 1 for missing AGENTPROBE_SERVER_TOKEN

Co-authored-by: Codex <codex@openai.com>
AgentProbe server run control and presets
…SYM-23)

Phase 3 of the AgentProbe server: ship historical-run comparison as an API
and dashboard workspace, and introduce the persistence abstraction needed to
back the server with Postgres behind `AGENTPROBE_DB_URL`.

- Persistence contract: new PersistenceRepository interface, URL parser /
  redactor, migration dispatcher (per-backend versioned) and a backend factory
  that returns SqliteRepository or PostgresRepository. SQLite free-function
  exports stay as compat wrappers; all new consumers go through the interface.
- Postgres backend: full DDL with jsonb columns and indexes for server
  filters + comparison lookups, migration runner, boot-time schema-version
  check that refuses to start when behind, preset/run/comparison reads, and
  preset CRUD. Run recorder (writes) is deferred to SYM-25.
- CLI `agentprobe db:migrate`: accepts --db or AGENTPROBE_DB_URL, prints
  backend / current / target / applied, fails clearly on unsupported schemes.
- Server config + /readyz + new /api/session expose the backend kind and a
  redacted db_url. Postgres URLs are now accepted via --db and env.
- Comparison controller: loads 2–10 runs, chooses alignment (preset snapshot
  → preset id → scenario id → file::id), emits runs, scenarios with per-run
  status/score/reason, delta_score, status_change, present_in, and summary
  buckets. GET /api/comparisons rejects out-of-range counts with the common
  error envelope and sets cache-control: no-store.
- Dashboard /compare workspace: ad-hoc multi-run picker, sticky summary bar,
  per-run columns, missing-scenario rows, "only changes" toggle,
  ?run_ids=…&only=changes deep links, and a "Compare last two runs" CTA on
  the preset detail view.
- Docker Compose example documents Postgres as an opt-in service; SQLite on
  a named volume remains the default.
- Playbook expansion: Postgres setup/migration/rollback/backup, comparison
  semantics, connection errors, and duplicate-scenario-id behaviour.
- Tests: unit coverage for url redaction + parse, migration runner (fresh,
  idempotent, upgrade-in-place, unsupported schemes), factory dispatch, and
  the comparison controller (alignment, delta/status_change, duplicate ids,
  range enforcement, 404 routing). Integration coverage starts a real server
  against seeded SQLite runs and exercises the /api/comparisons payload.

Follow-up SYM-25 tracks the buffered async Postgres recorder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AgentProbe Server Phase 3: comparison API, /compare dashboard, Postgres scaffold (SYM-23)
## Intent

Fix SYM-33 by making the Postgres run-recording limitation visible in the repository type system and by failing fast when the write-enabled server is configured with Postgres.

## Behavior changes

- `PostgresRepository` no longer exposes `createRecorder()`; only `RecordingRepository` implementations can create run recorders.
- `agentprobe start-server` rejects Postgres URLs while run write routes such as `POST /api/runs` are enabled, with a clear SQLite guidance error before schema probing or first write traffic.
- Server docs now state the current Postgres posture: migrations, preset CRUD, and historical reads are supported; run recording remains SQLite-only.

## Validation

- [x] `./scripts/fast-feedback.sh` passed
- [x] Behavior docs updated (if behavior changed)
- [x] `bun test --timeout 20000 tests/e2e/start-server.e2e.test.ts` passed
- [x] `bun test --timeout 20000 tests/e2e/cli.e2e.test.ts` passed
- [x] `git diff --check` passed

## Screenshots / video

N/A for CLI-only changes.
## Intent

Fix the inline dashboard run-detail renderer so malformed scenario ordinals cannot break out of the scenario link href or body text. This is defense-in-depth for run data consumed from `/api/runs/:id`.

## Behavior changes

No intended user-visible behavior changes for valid run data. The inline dashboard now escapes malformed ordinal/count/score values before inserting them into generated HTML.

## Validation

- [x] `./scripts/fast-feedback.sh` passed
- [x] `bun test tests/unit/server/inline-dashboard.test.ts` passed
- [x] `bun test --timeout 15000` passed
- [x] Behavior docs updated (if behavior changed; N/A for this security hardening)
## Intent

Fix the Postgres \ 2N+1 query pattern by batching related scenario-selection and latest-run reads by \.

## Behavior changes

No user-visible behavior changes. Preset listing keeps the same ordering and returned fields while reducing Postgres round trips from 2N+1 to a constant query count.

## Validation

- [x] \bun test v1.3.12 (700fc117)
- [x] \bun test v1.3.12 (700fc117)
- [x] \error TS5025: Unknown compiler option '--noEmit\'. Did you mean 'noEmit'?
- [x] \
- [x] \
- [x] Behavior docs updated (if behavior changed): no behavior docs needed for this non-user-visible persistence optimization

## Screenshots / video

N/A for persistence-only changes.
## Intent

Fix SYM-31 by adding explicit CORS handling for the AgentProbe control-plane API. The server now answers `/api/*` OPTIONS preflights centrally, attaches CORS response headers for allowed origins, and fails closed when operators expose the server externally without an explicit origin allow-list.

## Behavior changes

- `/api/*` OPTIONS preflights return `204` for allow-listed origins with allow-methods, allow-headers, allow-credentials, and max-age headers.
- `/api/*` OPTIONS preflights return `403` for unlisted origins.
- Non-preflight `/api/*` responses echo `Access-Control-Allow-Origin` for allowed origins.
- `AGENTPROBE_SERVER_CORS_ORIGINS` configures comma-separated exact `http://` or `https://` origins.
- `--unsafe-expose` / `AGENTPROBE_SERVER_UNSAFE_EXPOSE=true` now requires `AGENTPROBE_SERVER_CORS_ORIGINS` in addition to an auth token.

## Validation

- [x] `./scripts/fast-feedback.sh` passed on merge head `3774c52`
- [x] Behavior docs updated (if behavior changed)
- `bun test tests/unit/server/config.test.ts tests/integration/server/read-only.test.ts`
- `bun test tests/unit/server tests/integration/server`
- `bun run docs:validate`
- `bun test --timeout 15000` (pre-merge validation on original CORS head)

## Screenshots / video

N/A for CLI/server changes.
Add an `agentprobe` Docker Compose healthcheck so Compose can distinguish process start from server readiness. The probe calls `/readyz` from inside the container, and the server playbook documents the command plus failure debugging steps.

Validation:
- docker compose config
- docker compose up --build -d agentprobe; waited for healthy; curl /readyz; docker compose down -v
- bun run docs:validate
- bun test tests/integration/server/read-only.test.ts tests/unit/server/auth.test.ts tests/unit/server/config.test.ts
- git diff --check
- bun run fast-feedback
## Intent

Fix SYM-28 by removing the length-mismatch branch from bearer-token
comparison. `constantTimeEquals` now pads both UTF-8 byte arrays to a fixed
width, performs one timing-safe comparison, and only then gates the return value
on byte-length equality and compare-size bounds.

## Behavior changes

No user-visible behavior changes. Valid configured bearer tokens are still
accepted, invalid tokens are still rejected, and API auth coverage is unchanged.
The internal comparison path no longer has a distinct length-mismatch branch.

## Validation

- [x] `./scripts/fast-feedback.sh` passed
- [x] Behavior docs updated (if behavior changed)
  - Not applicable; this is an internal security hardening with unchanged
    public behavior.
- [x] `bun test tests/unit/server/auth.test.ts`
- [x] `git diff --check`
- [x] `bun test --timeout 20000 tests/e2e/start-server.e2e.test.ts`
- [x] `bun test --timeout 20000 tests/e2e/cli.e2e.test.ts --test-name-pattern "run records|openclaw commands"`

## Screenshots / video

N/A for CLI-only changes.
## Intent

Slim the Docker runtime image for SYM-30 by replacing the runtime-stage full-tree copy with a production dependency install and explicit runtime asset copies.

This removes tests, docs, scripts, agent metadata, and dev dependencies from the final image while preserving the Bun TypeScript server entrypoint, runtime data, and built dashboard bundle.

## Behavior changes

No CLI or API behavior changes. The published Docker image contents are narrower:

- Runtime stage now runs `bun install --production --frozen-lockfile`.
- Final image copies only `src`, `data`, and `dashboard/dist` from the build stage.
- `.dockerignore` excludes docs, scripts, tests, and agent metadata from the build context.

Image evidence:

- Before: `126398737` bytes; `/app` was `191M`.
- After: `71911094` bytes; `/app` is `15M`.
- Reduction: `54487643` bytes, about 43.1%.
- Final image inspection confirms `tests`, `dashboard/src`, `docs`, `scripts`, `.git`, `.agents`, `.claude`, `node_modules/@biomejs`, `node_modules/typescript`, `node_modules/bun-types`, and `node_modules/@types` are absent.

## Validation

- [ ] `./scripts/fast-feedback.sh` passed
  - Not used as the final authority because its `bun run test` step hits existing 5s e2e timeout failures in this workspace; filed SYM-40.
- [x] Behavior docs updated (if behavior changed)
  - No product behavior docs changed; Docker packaging only.

Validation run:

- `docker build -t agentprobe:sym-30-before .` on the original `origin/dev` Dockerfile for reproduction.
- `docker run --rm --entrypoint sh agentprobe:sym-30-before -c '...'` confirmed broad runtime tree and dev deps present.
- `docker build -t agentprobe:sym-30-after .`
- `docker run --rm --entrypoint sh agentprobe:sym-30-after -c '...'` confirmed excluded paths/dev deps absent and runtime paths present.
- `bun run docs:validate` passed.
- `bun run test` failed only on existing 5s e2e timeouts; see SYM-40.
- `bun run test:e2e` reproduced the same 5s e2e timeout failures; see SYM-40.
- `bun test --timeout 30000 tests/e2e` passed: 20 pass, 0 fail.
- `bun run test tests/integration/server` passed: 8 pass, 0 fail.
- `bun run lint` passed.
- `bun run typecheck` passed.
- `bun run dashboard:build` passed.
- `AGENTPROBE_SERVER_TOKEN=sym30-test-token OPEN_ROUTER_API_KEY=dummy-key docker compose -p sym30 up --build -d` passed.
- Compose smoke: `/healthz` returned 200, authenticated `/api/presets` returned 200, authenticated dry-run `/api/runs` completed with `passed: true` and `exitCode: 0`.
- `docker compose -p sym30 down -v` cleaned up the stack.

## Screenshots / video

N/A for CLI-only changes.
## Intent

Fix SYM-27 by making database URL redaction handle passwords that contain reserved characters such as `@`, `:`, `/`, and `%`, then apply redaction consistently to operator-visible output.

## Behavior changes

Database URL passwords are now redacted using URL parsing for all schemes with userinfo credentials. Config errors, SQLite unsupported-URL errors, health/readiness output, migration output, and the start-server banner avoid exposing raw configured passwords.

The branch has also been synced with current `dev`; conflict resolution preserved the `dev` Postgres write-mode restriction, CORS/readiness updates, and Postgres batching tests while keeping the SYM-27 redaction behavior.

## Validation

- [ ] `./scripts/fast-feedback.sh` passed locally
  - Latest landing run passed repo validation, lint, and typecheck, then hit known local 5s e2e timeout cases tracked separately as SYM-41: `agentprobe start-server > boots without OPEN_ROUTER_API_KEY and shuts down on SIGTERM` and `bun e2e baseline for the typescript cli > run records the suite in sqlite and report renders both explicit and discovered outputs`.
- [x] GitHub CI passed for commit `987af8fb5`.
- [x] Behavior docs updated (if behavior changed)
- [x] `bun test tests/unit/persistence/url.test.ts tests/unit/server/config.test.ts`
- [x] `bun run lint` via fast-feedback
- [x] `bun run docs:validate` via fast-feedback
- [x] `bunx tsc --noEmit && bun run --cwd dashboard typecheck` via fast-feedback

## Screenshots / video

N/A for CLI-only changes.
## Summary
- `bun run test`, `bun run test:coverage`, and `bun run test:e2e` now pass `--timeout 30000` so the CLI-spawning e2e cases don't trip Bun's 5s default once `bunfig.toml` enables coverage instrumentation.
- Documented the choice and intent in `docs/HARNESS.md`.

## Test plan
- [x] `bun run test` (88 pass)
- [x] `bun run test:e2e` (19 pass)
- [x] `bun run docs:validate`

Linear: https://linear.app/autogpt/issue/SYM-40

🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Intent

Fix SYM-34 by making server-managed run executor failures observable outside
SSE subscribers. The controller now writes a structured `run_executor` stderr
line, persists the failure on the run record when a run ID exists, and still
publishes the terminal `run_error` stream event.

## Behavior changes

Failed `agentprobe start-server` runs that reach the executor catch path now
remain visible through `/api/runs/:runId` after the SSE session closes, with
`finalError` populated for the historical run detail. Operators also get a JSON
stderr log line with the run ID, error type, message, and stack.

## Validation

- [x] `./scripts/fast-feedback.sh` passed
- [x] Behavior docs updated (if behavior changed)
- [x] `bun test tests/integration/server/write-control.test.ts` passed
- [x] `bun run typecheck` passed
- [x] `bun run lint` passed

## Screenshots / video

N/A for CLI/server behavior changes.
Summary:
- Add Happy DOM CompareView unit tests for the only-changes filter,
  picker apply state, empty aligned rows, and null versus zero score
  formatting.
- Extend comparison integration and controller tests for three-run
  file/id alignment, empty comparison rows, malformed run IDs,
  duplicate run IDs, and structured bad_request responses.
- Reject malformed or duplicate compare run IDs before repository
  lookup, and document the run UUID validation contract.

Rationale:
- SYM-37 called out compare coverage gaps across both the dashboard UI
  and /api/comparisons endpoint behavior.
- Duplicate run IDs previously deduped silently, which hid bad input and
  made the endpoint validation contract weaker than the ticket requires.

Tests:
- bun test tests/unit/dashboard/compare-view.test.tsx
- bun test tests/integration/server/comparisons.test.ts
- bun test tests/unit/server/comparison.test.ts
- bun run docs:validate
- bun run lint
- bun run typecheck
- bun run fast-feedback

Co-authored-by: Symphony Agent <swifty@symphony.ai>
Co-authored-by: Codex <codex@openai.com>
## Intent

Remove server-layer imports of concrete SQLite persistence helpers by routing controllers and routes through typed persistence repository interfaces. This addresses SYM-35 and keeps backend-specific initialization inside persistence implementations.

## Behavior changes

No behavior changes. The server still uses SQLite for write-enabled start-server mode, while read/preset repository methods remain available through the typed SQLite and Postgres backends.

## Validation

- [x] `./scripts/fast-feedback.sh` passed
- [x] Behavior docs updated (if behavior changed): no behavior changes

Additional evidence:

- `rtk rg -n "providers/persistence/(sqlite|postgres)-|sqlite-run-history|sqlite-connection|postgres-backend|SqliteRunRecorder" src/runtime/server` returns no matches.
- `rtk bun run typecheck` passed.
- `rtk bun run test` passed: 139 tests before latest dev merge; fast feedback passed after the final `origin/dev` merge with 152 tests.
- GitHub CI passed on head `73a110c` after resolving the merge conflict against `dev`.

## Screenshots / video

N/A for CLI-only changes.
Summary:
- Add the repo-local update-harness skill and Claude skill symlink.
- Add a Bun-owned ci command and route GitHub CI through it.
- Tighten generated-doc freshness and workspace inventory generation.

Rationale:
- Keeps the harness upgrade workflow available in-repo for future agents.
- Gives local and hosted CI one shared command instead of duplicated YAML.
- Makes generated inventory depend on tracked files and catches real drift.

Tests:
- bun install --frozen-lockfile
- bun run ci

Co-authored-by: Codex <codex@openai.com>
Summary:
- Make run recording async and add a Postgres recorder for full run
  lifecycle writes.
- Add Postgres storage for encrypted settings and endpoint overrides,
  including schema version 3 migrations.
- Reuse a long-lived Bun SQL client per repository and close it during
  server shutdown.
- Allow start-server to boot with postgres URLs and document the
  required migration and encryption-key workflow.
- Add env-gated Postgres recorder, secret, and migration tests.

Rationale:
- Production server deploys need durable networked persistence while
  preserving SQLite as the local default.
- Postgres writes are async and scenario ids come from BIGSERIAL, so the
  recorder contract now reflects the backend reality.
- Secrets remain encrypted app-side; Postgres deployments require an
  explicit AGENTPROBE_ENCRYPTION_KEY instead of a local sidecar file.

Tests:
- bun run lint
- bunx tsc --noEmit
- bun run test
- bun run docs:validate
- bun run ci

Co-authored-by: Codex <codex@openai.com>
Summary:
- Add the new persistence docs, Postgres recorder, and Postgres tests to
  the generated workspace inventory.

Rationale:
- The files were added to the previous commit, but the inventory was
  generated before they were tracked, causing fast-feedback and CI to
  fail the generated-doc freshness check.

Tests:
- ./scripts/fast-feedback.sh

Co-authored-by: Codex <codex@openai.com>
Swiftyos and others added 26 commits April 30, 2026 09:36
…31)

## Summary
- `assertProcessCompletes` now throws a descriptive error when the per-test
  watchdog fires, instead of silently returning the SIGTERM exit code (143).
  This prevents a timed-out child from masking the real failure with a
  downstream assertion mismatch.
- Adds a 2s SIGKILL escalation after SIGTERM so cleanup is deterministic
  even if the CLI ignores SIGTERM.

## Why
SYM-39 third acceptance criterion: test process cleanup must be deterministic
and a timed-out child must not mask the real failure cause. SYM-40 already
bumped the per-test timeout to 30s so the primary timeout flakiness is gone;
this change makes the remaining timeout path self-explanatory.

## Test plan
- [x] bun run fast-feedback (88 pass, 0 fail)

Linear: https://linear.app/autogpt/issue/SYM-39
* Added deployment workflow and helm charts

* updated workspace docs

* fix lint issues
Move Perf/PerfTracker into src/shared/observability with an
AsyncLocalStorage so persistence and route layers can record spans
without plumbing the tracker through every call site. Wire withPerf
into the response-budget middleware so the breakdown logged on
budget breaches now names the actual culprit (per-table queries in
loadScenarioRecords, repo.getRun, JSON serialization) instead of
showing 'unaccounted'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The run-detail polling endpoint loaded all scenario children (turns,
target events, tool calls, checkpoints, judge dimension scores) even
though the dashboard overview only renders per-scenario summary fields.
The per-scenario route loaded the same full payload then kept one
ordinal. Add GetRunOptions { summary?, ordinal? } and have the routes
request only what each page needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The HTTP run-detail and per-scenario routes always strip
scenarioSnapshot/personaSnapshot/rubricSnapshot/expectations/tags before
responding, but the postgres reader was loading those wide JSONB columns
anyway via select *. Add SCENARIO_RUN_HTTP_COLUMNS and use it whenever
getRun is invoked with summary or a specific ordinal — internal callers
that need the snapshots still receive them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6aa0226dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +35 to +39
pathname.startsWith("/runs") ||
pathname.startsWith("/suites") ||
pathname.startsWith("/presets") ||
pathname === "/start" ||
pathname === "/compare"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add /endpoints to dashboard SPA fallback paths

The static-route allowlist for SPA fallback omits "/endpoints", even though the dashboard router and nav include that page (dashboard/src/App.tsx routes to /endpoints). When serving dashboardDist, a direct browser load or refresh on /endpoints will return 404 instead of index.html, breaking that view outside in-app navigation.

Useful? React with 👍 / 👎.

Comment on lines +118 to +122
const snapshot = context.streamHub.publish({
runId,
kind: "snapshot",
payload: snapshotPayloadForRun(historicalRun),
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop mutating the stream buffer when sending snapshots

When there is no replay buffer but a historical run exists, this code calls streamHub.publish to build a one-off snapshot response. publish appends into the per-run ring and creates a run buffer if missing, so every SSE read of an old run adds retained events/state in memory rather than just serializing a transient payload. This can steadily grow StreamHub state for viewed historical runs because those buffers are never cleared in the normal request path.

Useful? React with 👍 / 👎.

* update rubric

* add human scoring feature with rubric correlation tracking

Adds an end-to-end "Score" surface for human review of completed runs:
a new persisted human_dimension_scores table mirroring judge_dimension_scores,
HTTP routes that drain an unscored backlog one chat at a time, and a React
dashboard view with rubric/objective/tool-call sidebars and Pearson-correlation
pills against the LLM judge scores. Replaces the legacy inline dashboard with
the built React bundle as the only frontend, and adds a one-shot
seed-test-scores script for retargeting old data onto the new product rubric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix postgres-backend test mock for sql.unsafe column fragments

The PostgresRepository.listPresets path uses `sql.unsafe(RUN_SUMMARY_COLUMNS)`
inline inside a tagged template to interpolate the column list. JS evaluates
that call eagerly before the tagged template runs, so the mock's `sql.unsafe`
was being invoked with just the column list and throwing because the text did
not match any "from <table>" branch. Make `sql.unsafe` return an inert empty
result for fragment-style calls instead of throwing; the parent template still
records the real query string so the existing query-count assertions hold.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refresh generated docs (quality score + workspace inventory)

Re-run of `docs:quality` and `docs:workspace` after the test fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant