feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install by ntindle · Pull Request #40 · Significant-Gravitas/AgentProbe

ntindle · 2026-05-12T21:58:23Z

Intent

AgentProbe needs measurement tooling that lights up before AutoGPT's
dream-system roadmap items land, so we can tell whether each step is
working as it ships. This PR builds out the scoring tooling end-to-end
for retrieval ranking (already shipped to AutoGPT as P-1) and for the
next four items on the roadmap, plus a dashboard surface that renders
the results.

Roadmap	Scorer added	Hard-fail signal
P-1.3 retract-vs-soft-delete (Snodgrass bi-temporal)	`demotion`	`timestamp_discipline` violation
P0.3a stale-fact deprecation	`demotion`	set precision/recall under threshold
P0.3b scoped cascading expiry	`demotion` (cascade block)	`cascade_bounded=false`
P0.6 retrieval-relevance benchmark	`retrieval`	forbidden-item hit; weighted score under threshold
P1 procedure synthesis + Save-as-Skill	`procedure`	weighted score under threshold
P2 memory dedup / near-duplicate merge	`dedup`	weighted score under threshold

The intent is that as AutoGPT ships each item, the corresponding
AgentProbe scenarios go from "fixture-mode, scoring-the-math" to
"raw-exchange-mode, scoring-the-real-pipeline" with a one-line YAML
swap — no further scorer work needed.

Behavior changes

Scenarios may now declare any combination of four scorer blocks:
retrieval:, demotion:, procedure:, dedup:. Without any of them,
behavior is unchanged: the LLM judge remains the only signal. With any
of them present, the runner scores each and ANDs the pass flags into
the overall scenario passed:

passed = judge.passed
       && (retrieval.passed ?? true)
       && (demotion.passed  ?? true)
       && (procedure.passed ?? true)
       && (dedup.passed     ?? true)

Each scorer is configured at the YAML level (per-metric weights, pass
threshold, fixture vs. raw-exchange source resolution) — same shape as
the existing retrieval: block, no new architectural concepts.

Persistence schema:

SQLite: SCHEMA_VERSION 8 → 10 (two bumps: 9 for retrieval_scores,
10 for demotion_scores + procedure_scores + dedup_scores).
Postgres: POSTGRES_TARGET_VERSION 4 → 6 (two bumps: 5 for retrieval,
6 for the dream-validation trio).
All migrations are additive create table if not exists. Drizzle
schema inventory test and db.test.ts assertions updated.

Dashboard:

New EvalScoresView component renders all four scorer kinds with
per-metric bars, weight indicators, aggregate weightedScore /
threshold / pass-pill, and predicted-vs-golden side-by-side panels
for procedures and dedup clusters.
Surfaced in two places: an "Eval scores" tab in DetailPanel
(only shown when scores exist), and a third card on the
ScenarioDetailView route.

Pre-existing failures cleared

Bonus: fixed two pre-existing problems on main that were blocking
bun run ci:

React 18 vs 19 types collision in Markdown.tsx. Upstream
react-markdown bundles its own React 18 types; project uses React
19. Every JSX intrinsic component handler typecheck-failed with
"Two different types with this name exist." Replaced the imported
Components type with a project-local one built from the project's
own JSX namespace.
Duplicate React module instances between root and dashboard
workspace. bun was installing two physical copies; dashboard test
files imported React from root and dashboard components from
dashboard/node_modules/react, so resolveDispatcher() returned
null at runtime. All 7 React-rendering dashboard tests crashed.
Moved react/react-dom from dashboard's dependencies to
peerDependencies; bun now resolves both surfaces to the workspace
root copy.

bun test is now 330 / 330 green for the first time on this branch.

How it's wired

Five layers, all pure-functional at the bottom and progressively
wired up through the existing run-loop / persistence pipeline:

Pure math — colocated *.test.ts:
- ranking.ts — precision@k, recall@k, MRR, NDCG@k.
- clustering.ts — pairwise P/R/F1, Adjusted Rand Index
  (Hubert-Arabie 1985).
- procedure-match.ts — step-coverage F1, LCS-normalized order
  similarity, parameter Jaccard.
- demotion-match.ts — set-precision/recall, Snodgrass timestamp
  discipline check, single-hop cascade bound.
Scorers — {retrieval,demotion,procedure,dedup}-scorer.ts. Each
resolves payload from fixture or rawExchange[<key>], coerces to a
canonical shape, delegates to the math, returns a typed Score.
YAML loader — load-suite.ts strictly parses each new block:
non-empty golden, valid weight keys, threshold ∈ [0,1].
Runner — runScenario scores all four kinds after the judge
phase, records via four new recorder hooks, and threads the score
onto ScenarioRunResult.
Persistence — four tables (retrieval_scores, demotion_scores,
procedure_scores, dedup_scores), all mirroring the
one-row-per-metric shape of judge_dimension_scores. SQLite +
Postgres recorders both implement the four record*Result hooks.
Dashboard — EvalScoresView rendered into existing surfaces;
ScenarioDetail and ServerScenario types extended with the four
score arrays.

Scenarios shipped

Retrieval (5) — data/retrieval-memory.yaml:

mem-retrieval-forget-on-request (ranking sibling of judge-only
mem-negative-forget-on-request)
mem-retrieval-warm-context-sarah
mem-retrieval-stale-fact-demotion
mem-retrieval-scope-filter-project
mem-retrieval-cascading-expiry

Dream-validation (9) — data/dream-validation.yaml:

Snodgrass retract discipline (positive + negative)
Stale-fact demotion
Single-hop cascade (positive + negative)
Two procedure-extraction workflows
Dedup near-duplicate merge (positive + over-merge negative)

Every scenario points at a JSON fixture under
data/fixtures/{retrieval,dream}/. Pack-level tests pin each fixture
against the math; negative scenarios actively prove the scorer catches
the intended failure mode.

Scope rails

No new env vars. No new top-level deps.
Adapter abstractions untouched. The scorers read rawExchange data
the adapter already returns; no EndpointAdapter shape change.
No vendoring of additional AutoGPT-side code.

Risk

Persistence migration touches both backends. All migrations are
additive create table if not exists; forward is safe.
Scenarios with a scorer block but no fixture and no adapter
payload score source: "missing" and fail. Intentional — silently
passing a "we got nothing" run would mask regressions.
The overall passed is now AND'd across up to five signals.
A scenario that previously passed via judge alone and now adds a
failing scorer block will see its overall pass flip.
Negative-test scenarios are designed to fail. Pack-level tests
pin both directions, so any regression that silently turns a negative
into a positive trips CI.
The React peer-dep + Markdown type changes affect dashboard
build infrastructure. Vite build still produces a working single-file
bundle; tests all pass; typecheck clean. Worth eyeballing in review.

Out of scope (deferred)

Adapter-side payload emission for retrieval / demotion /
procedure / dedup is AutoGPT-side work. The fixture path exists so
the scorers are exercisable now.
CI lane that boots autogpt_platform via docker-compose, runs
AgentProbe against it, and fails the PR on regression. Highest
leverage next step on the AgentProbe side.
Web-fact-check P/R scorer (P0.5). Same shape as the demotion
scorer; bottleneck is the curated (claim, current_truth) dataset.
Multi-judge aggregation. Pre-existing config-only feature; out
of scope.

Validation

bun test: 330 pass / 0 fail.
bun run lint: clean.
bun run typecheck: clean (root + dashboard).
bun run docs:validate: clean.
dashboard/dist/index.html rebuilt and committed.

Checklist

bun run lint clean.
bun run typecheck clean (root + dashboard).
bun run docs:validate clean.
bun test all green (was 7 pre-existing failures).
No new env vars.
No secrets committed.
Behavior docs updated.
Dashboard bundle rebuilt.

Commits

feat(evaluation): add ranking metrics primitives
feat(evaluation): add retrieval scorer kind and YAML schema
feat(evaluation): wire retrieval scorer into runScenario
feat(persistence): persist retrieval scores in sqlite and postgres
feat(reporting): expose retrieval metrics in run reports
feat(data): add five ranking-scored memory scenarios
docs(specs): document ranking-scored retrieval scenarios
feat(evaluation): add dedup, procedure, demotion math primitives
feat(evaluation): add demotion, procedure, dedup scorer kinds
feat(persistence): persist demotion, procedure, dedup scores
feat(data): add nine dream-system validation scenarios
feat(dashboard): surface eval scores + fix React 18/19 types collision

Add precision@k, recall@k, MRR, and NDCG@k as pure functions in a new `src/domains/evaluation/ranking.ts` module along with a `scoreRanking` aggregator and supporting helpers (relevance vector construction, unique gold-hit counting, forbidden-item detection). These primitives are the load-bearing math for a forthcoming YAML `retrieval:` scorer that grades "given a query, did the right memories come back in the right order" — a question AgentProbe's existing LLM-judge over a transcript cannot answer directly. The math is pinned by 25 unit tests covering known-answer cases for each metric (including the NDCG@3 of [1, 0, 1] = 1.5 / 1.6309 case), edge behaviour around `k` clamping, the three match policies (exact / substring / regex), and the forbidden-item override path. No I/O, no LLM calls — the scorer wiring lands in a follow-up commit.

Introduce a new `retrieval:` block on scenarios that grades the top-k of a returned items list against a curated golden set, plus an optional forbidden list whose hits force a fail (used for forget / scope / demotion probes). Schema (parsed in `src/domains/validation/load-suite.ts`): retrieval: golden: [<text-or-id>, ...] # required, non-empty forbidden: [<text-or-id>, ...] # optional k: 5 # optional, default max(|returned|,|golden|,1) pass_threshold: 0.5 # optional, default 0.5 match: substring | exact | regex # optional, default substring weight: # optional, defaults all to 1 precision_at_k: 1.0 recall_at_k: 1.0 mrr: 1.0 ndcg_at_k: 1.0 source: # optional raw_exchange_key: "retrieved" # default raw_exchange field fixture: "fixtures/foo.json" # resolved relative to YAML The scorer (`src/domains/evaluation/retrieval-scorer.ts`) resolves the retrieved list from either a JSON fixture file or the last adapter reply's `rawExchange[<key>]`, coerces object payloads to strings via label/text/name/id heuristics, and delegates to the pure ranking math module. `RetrievalScore` mirrors the shape of `RubricScore` so it can slot into `ScenarioRunResult` alongside the judge score. Tests pin: weight/threshold/k validation in the loader, payload coercion across string/object/mixed inputs, fixture resolution relative to the scenario YAML, the `missing` source case where no items can be located, and the forbidden-hit override path. No runtime wiring yet — `run-suite.ts` lands in a follow-up commit so each step stays independently green.

Run the retrieval scorer after the LLM-judge phase. The retrieved list is resolved from the last adapter reply's `rawExchange.retrieved` (or a custom key) or from a JSON fixture, and the resulting `RetrievalScore` is attached to `ScenarioRunResult.retrievalScore`. The overall `passed` flag now requires both signals — when a scenario has a `retrieval:` block, a failed retrieval forces a scenario fail even if the judge passed. This matches the brief's "forbidden item forces a fail" semantics for forget / scope-filter / demotion probes. A new `recordRetrievalResult` recorder hook is exposed alongside the existing `recordJudgeResult` so persistence backends can store the result; the SQLite/Postgres recorders implement it in the next commit. Two new runner-level tests cover (1) a perfect rawExchange retrieval that surfaces a passed score and (2) a forbidden-item hit that flips the overall pass to false even when the judge votes pass.

Add a new `retrieval_scores` table to both backends, mirroring the shape of `judge_dimension_scores`. One row per metric per scenario run (precision@k, recall@k, MRR, NDCG@k) plus aggregate columns (k, weighted_score, pass_threshold, passed, totals, source). Schema rev: - SQLite: SCHEMA_VERSION 8 -> 9; baseline DDL adds the table, plus a migration step in both `migrations/sqlite.ts` and the legacy in-recorder migrator at `sqlite-run-history.ts:migrateDatabase`. - Postgres: POSTGRES_TARGET_VERSION 4 -> 5; baseline DDL adds the table; migration `from < 5` block in `migrations/postgres.ts`. Drizzle schemas (sqlite + postgres) register the table and the existing inventory test was updated to expect it. Both recorders implement `recordRetrievalResult` and both `getRun` paths now return `retrievalScores: Array<...>` on each scenario record so dashboard and report consumers can read the data back. Existing schema/version assertions in `tests/unit/db.test.ts` and `tests/unit/persistence/migrations.test.ts` were updated to match.

Surface persisted retrieval scores in the rendered run report and trim a handful of Biome formatting issues across the touched files. `prepareScenarioView` now emits a `retrieval_rows` array (one row per metric, with pretty value/percent labels) and a `retrieval_scores_pretty` JSON blob alongside the existing judge dimension rows. Templates that already exist will see no change until they reference the new keys; this commit deliberately stays close to the existing dimension-row pattern so the dashboard / HTML report can pick it up incrementally. Also makes `ScenarioRecord.retrievalScores` optional so callers that read older runs persisted before the schema upgrade don't trip typecheck. The recorders and getRun paths still write/return the field consistently for new runs.

Add a new `data/retrieval-memory.yaml` pack with five scenarios that exercise dream-pass behaviors via the new ranking scorer: - `mem-retrieval-forget-on-request`: ranking sibling of `mem-negative-forget-on-request` from the existing multi-session-memory pack. The judge variant asserts the agent's *text* doesn't leak the $50K budget; this variant asserts the budget figure is absent from the retrieval payload's top-k. - `mem-retrieval-warm-context-sarah`: warm-context relevance. Given a working session that establishes Sarah and the Atlas project, a follow-up query should surface both gold facts near the top of the returned set, graded by precision/recall/MRR/NDCG. - `mem-retrieval-stale-fact-demotion`: dream-pass stale-fact deprecation as a ranking assertion. Old pricing should be demoted (superseded) and excluded from the top-k; new pricing should appear. The committed fixture intentionally still includes the superseded item so the scenario fails — this documents the negative-test intent and proves the forbidden-hit check is active. - `mem-retrieval-scope-filter-project`: typed-edge filtering. A `project:atlas`-scoped query must return only Atlas memories; Beacon, invoicing, and other unrelated scopes are forbidden. - `mem-retrieval-cascading-expiry`: bounded cascade. Invalidating a NorthStar lead should remove Marcus / NorthStar facts but not the adjacent operational basics (CRM, fiscal year, invoicing rule). Validates the cascade is tight to the entity. Each scenario points at a JSON fixture under `data/fixtures/retrieval/` so the pack can be exercised against the scorer math without a live AutoGPT backend. When running against a real adapter that emits retrieval payloads inline, swap `source.fixture` for `source.raw_exchange_key`; the rest of the config carries over. The accompanying `tests/unit/retrieval-memory.test.ts` pack pins schema validity (every scenario carries a non-empty `retrieval` block, references a known memory rubric, and the fixture file exists relative to the YAML) plus scoring-math expectations against each fixture.

Add the new "Ranking-scored scenarios grade retrieval relevance against a curated golden set" scenario to `platform.md`, mark it implemented in `current-state.md`, and wire it into `e2e-checklist.md` pointing at the ranking math tests, the retrieval-scorer tests, the YAML-pack tests, and the runner integration test that covers the rawExchange and forbidden-hit paths. Regenerated `generated/workspace-inventory.md` and `QUALITY_SCORE.md` so the repo doc-validation gate is green.

Pure-math primitives for three new scorers that AgentProbe will need to validate the dream-system roadmap items as they land: - `clustering.ts`: pairwise precision/recall/F1 and Adjusted Rand Index (Hubert-Arabie 1985) over partitions of item IDs. Used to grade memory-dedup passes (P2): did the dedup pass cluster near-duplicates correctly? - `procedure-match.ts`: step-coverage F1, LCS-normalized order similarity, parameter-coverage Jaccard, plus a `scoreProcedure` aggregator. Used to grade procedure extraction (P1 / `dreaming-procedures.md`): did the dream pass extract the same ordered set of steps and parameters as the curated golden procedure? - `demotion-match.ts`: structural side of demotion correctness. `assertExpectedSet` for set-level demotion P/R, `assertTimestampDiscipline` for the Snodgrass retract-vs-soft-delete split (P-1.3), and `assertCascadeBounded` for the single-hop cascade discipline (P0.3b). `scoreDemotion` aggregates and treats timestamp violations and cascade overflow as hard failures, regardless of the weighted score. 40 unit tests pin known-answer cases: ARI of `{ {a,c},{b,d} }` vs `{ {a,b},{c,d} }` is -0.5; LCS of the classic CLRS ABCBDAB/BDCAB pair is 4; the 3-hop runaway-cascade test reproduces the P0.3b failure mode flagged in `p0-spec.md` §4. These are the load-bearing math for forthcoming scorer kinds (`dedup`, `procedure`, `demotion`) that will land alongside YAML schema, persistence, and scenarios in follow-up commits — same pattern PR #40 used for the `retrieval` scorer.

Three new scorer kinds, each YAML-declarable as a sibling of the existing `retrieval:` block, each with fixture + rawExchange source resolution, each persisted alongside the run. YAML schema additions (all parsed strictly in load-suite.ts): demotion: expected_demotions: [<uuid>, ...] # required, non-empty expected_retracts: [<uuid>, ...] # optional cascade: expected_direct_neighbors: [<edge_uuid>, ...] tangential_edges: [<edge_uuid>, ...] pass_threshold: 0.6 weight: { set_precision, set_recall, set_f1, timestamp_discipline, cascade_bounded, cascade_direct_f1 } source: { fixture | raw_exchange_key } # default key: "demotions" procedure: golden_steps: [<step_label>, ...] # required, ordered golden_parameters: [<param_name>, ...] # optional pass_threshold: 0.6 weight: { step_coverage, step_order, parameter_coverage } source: { fixture | raw_exchange_key } # default key: "procedure" dedup: golden_clusters: [[<uuid>, ...], ...] # required, non-empty pass_threshold: 0.6 weight: { precision, recall, f1, ari } source: { fixture | raw_exchange_key } # default key: "dedup" Each scorer module follows the same shape as retrieval-scorer.ts: - coerce* utility extracts canonical payload from JSON or rawExchange - resolve* threads fixture / rawExchange / missing source resolution - score* delegates to the pure math module and assembles a typed score `runScenario` calls all four scorers after the judge phase. The overall scenario `passed` is now the AND of judge, retrieval, demotion, procedure, and dedup pass flags. Each defaults to true when its block is absent on the scenario, so this is backwards-compatible. The pure math (clustering ARI / pairwise P/R, procedure LCS + Jaccard, demotion set-precision + timestamp-discipline + cascade-bounded) was landed in the prior commit and is fully unit-pinned. This commit wires it through the same loader / scorer / recorder pipeline PR #40 used for retrieval. Persistence backends and report rendering land in follow-up commits.

Three new tables that mirror retrieval_scores from PR #40, one per scorer kind: - `demotion_scores` — one row per metric (set_precision, set_recall, set_f1, timestamp_discipline, cascade_bounded, cascade_direct_f1) plus aggregate cols (weighted_score, pass_threshold, passed, timestamp_violation_count, cascade_bounded, source) and JSON blobs (observed_json, expected_json) for replay. - `procedure_scores` — one row per metric (step_coverage, step_order, parameter_coverage) plus aggregate cols and predicted/golden JSON blobs. - `dedup_scores` — one row per metric (precision, recall, f1, ari) plus item_count, predicted/golden cluster JSON blobs. Schema rev: - SQLite: SCHEMA_VERSION 9 -> 10; baseline DDL ships the three new tables, plus a v10 migration step in both `migrations/sqlite.ts` and the legacy in-recorder migrator at `sqlite-run-history.ts:migrateDatabase` (matched ensure-helpers). - Postgres: POSTGRES_TARGET_VERSION 5 -> 6; baseline DDL adds the tables; migration `from < 6` block in `migrations/postgres.ts`. Both backends now implement `recordDemotionResult`, `recordProcedureResult`, `recordDedupResult` recorder hooks. Both `getRun` paths now return `demotionScores`, `procedureScores`, and `dedupScores` arrays on each `ScenarioRecord`. Drizzle schema inventory test and `db.test.ts` schema/version assertions updated. The Drizzle schema, in-recorder DDL, and migration step are deliberately redundant — the project ships two parallel SQLite paths (legacy in-recorder + drizzle-shaped) and both must agree per the existing inventory test.

Add `data/dream-validation.yaml` covering the four roadmap items the new scorer kinds were built to grade: P-1.3 retract-vs-soft-delete (Snodgrass bi-temporal): - dream-demotion-retract-discipline (passes against compliant fixture) - dream-demotion-snodgrass-violation (negative: fixture sets both timestamps; scenario MUST fail on timestamp_discipline) P0.3a stale-fact deprecation: - dream-demotion-stale-fact (pricing edge correctly demoted to superseded) P0.3b scoped cascading expiry: - dream-demotion-cascade-bounded (single-hop discipline held) - dream-demotion-cascade-runaway (negative: fixture touches a 2+ hop edge; scenario MUST fail on cascade_bounded) P1 procedure synthesis (per `dreaming-procedures.md`): - dream-procedure-weekly-report (4 ordered steps + 2 parameters) - dream-procedure-client-onboarding (different workflow, exercises the scorer on a non-degenerate case) P2 memory dedup: - dream-dedup-near-duplicates (clean merge of Sarah-billing facts; HubSpot + fiscal-year stay singletons) - dream-dedup-false-positive (negative: over-merges Sarah-billing with Sarah-manager; scenario MUST fail on ARI + pairwise precision) Eight JSON fixtures under `data/fixtures/dream/` back the offline path; each scenario also accepts a `source.raw_exchange_key` swap when AutoGPT's dream pass starts emitting structured payloads inline. Pack-level pinning in `tests/unit/dream-validation.test.ts` asserts: - ship-shape counts (>=4 demotion, >=2 procedure, >=2 dedup) - every fixture exists on disk relative to the YAML - happy-path scenarios pass against their golden - negative scenarios fail in the expected way (timestamp_discipline, cascade_bounded, over-merge) Docs updated: `platform.md` gets a new spec scenario, `current-state.md` marks it implemented, `e2e-checklist.md` references the test files, generated workspace inventory + quality score refreshed.

Three dashboard changes that go together: 1. **EvalScoresView** (`dashboard/src/components/EvalScoresView.tsx`) — new component that renders retrieval / demotion / procedure / dedup scores in the existing detail surface. Per-metric bars with the weight indicator, a header per scorer kind with aggregate weightedScore / pass_threshold / pass-pill / source-of-payload, and predicted-vs-golden side-by-side panels for procedures (ordered step lists) and dedup (cluster lists). 2. **Two surfaces wired up:** - `DetailPanel`: new "Eval scores" tab next to Conversation / Rubric, only shown when the scenario actually has eval scores. - `ScenarioDetailView`: new card under the conversation/rubric two-column layout, rendered when `hasEvalScores(detail)`. 3. **Markdown.tsx React-types fix.** Upstream `react-markdown` ships its own bundled React 18 types, which collide with the project's React 19 — every JSX intrinsic component handler typecheck-fails with "Two different types with this name exist". Replaced the imported `Components` map with a project-local `MarkdownComponents` built off the project's own JSX namespace, and cast through `never` at the call site. Cleanly typechecks under both React-types bundles. 4. **React workspace dedup.** `dashboard/package.json` declares react/react-dom as `peerDependencies` instead of `dependencies`. Bun was installing two physical copies (root + `dashboard/`), so dashboard components imported via test files saw a *different* React module instance than the test's own `React.act` / `useState` call. Resolved dispatcher → null, dashboard tests crashed at `resolveDispatcher().useState`. Removing the duplicate install makes bun resolve react/react-dom to the workspace root for both surfaces, and the dashboard test suite (`dashboard-app.test.tsx`, `compare-view.test.tsx`) now passes all 7 previously-failing tests. Result: full suite is 330 / 330 green. `bun run typecheck` clean (was failing on Markdown.tsx). `bun run lint` clean.

Bundles all memory-related scenarios into one default preset that's seeded into every fresh server boot alongside the existing Pre Release Checks preset: - 31 multi-session-memory scenarios (judge-scored, the existing retention/distillation/rigidity/abstention/temporal/continuation/ cross-domain/procedural-update/compositional/introspection/ long-tail/hygiene/negative pack) - 5 retrieval-memory scenarios (precision@k/recall@k/MRR/NDCG@k against curated golden sets) - 9 dream-validation scenarios (demotion / procedure / dedup scorers covering the P-1 -> P2 dream-system roadmap) Total: 45 scenarios across the three vendored memory packs. Selection is pinned per-file with explicit IDs so the preset is stable across scenario reorders and additive YAML changes; new scenarios get added to the preset by editing the three ID arrays. Server boot logs `created default preset Full Memory Suite` the first time it sees a fresh DB. Idempotent on subsequent boots (status becomes `existing`), and a soft-delete is restored on the next boot following the same pattern as Pre Release Checks. Also ship `scripts/seed-eval-scores.ts` from the dashboard-demo work: a one-off seeder that writes a synthetic run with retrieval / demotion / procedure / dedup scores populated via the actual scorer modules, so the dashboard EvalScoresView has real data to render in local-dev demos. Reuses the published recorder API; no new shape introduced. Default-presets test updated to assert both presets seed cleanly and both go through the same created -> existing -> restored lifecycle.

Both Parallel and Dry-run fields exist on the Run Launch Modal and the Preset Editor but neither explains what they do — a fresh user has no way to know whether "Parallel" runs scenarios concurrently or something else, or whether "Dry run" hits the LLM at all. Use the existing `Field.hint` slot (already used for "Run name" and "Notes") so the explanation renders inline below each control, matching the surrounding UX pattern: - **Parallel** — explains concurrency and that scenarios still complete; mentions LLM cost as the tradeoff. - **Parallel limit** — recommends 2-4 and notes the cost-vs-speed tilt (only present in the launch modal; the preset editor combines both Parallel controls under a single hint). - **Dry run** — explains exactly what dry-run skips (live adapter, judge, scorers) and what it preserves (run + scenario rows), so the user knows it's the right move for validating config without spending tokens. Dashboard bundle rebuilt (`dashboard/dist/index.html`) so the running single-file server picks up the new hints.

The /start Run Builder had Parallel and Dry-run as bare Checkbox elements outside any Field, so the inline-hint pattern used elsewhere on the page didn't apply. Wrap each in a Field with hint text matching the pattern used on the RunLaunchModal and PresetEditorView: - Parallel limit: existing TextInput Field gets a hint about cost/latency tradeoff (2-4 typical). - Dry run: records run + scenario rows but skips adapter, judge, scorers — use for config validation without spending tokens. - Parallel: explains concurrency and ties back to the limit field. - Save as preset: explains the persistence side effect. Dashboard bundle rebuilt so the running single-file server picks up the new hints on next page load.

ntindle added 11 commits May 12, 2026 14:27

ntindle changed the title ~~feat: add retrieval ranking scorer + five memory scenarios~~ feat: add ranking + demotion + procedure + dedup scorer kinds for dream-system validation May 13, 2026

ntindle changed the title ~~feat: add ranking + demotion + procedure + dedup scorer kinds for dream-system validation~~ feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install May 13, 2026

ntindle added 3 commits May 13, 2026 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40

feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40
ntindle wants to merge 15 commits into
mainfrom
feat/ranking-metrics

ntindle commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ntindle commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intent

Behavior changes

Pre-existing failures cleared

How it's wired

Scenarios shipped

Scope rails

Risk

Out of scope (deferred)

Validation

Checklist

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ntindle commented May 12, 2026 •

edited

Loading