Skip to content

feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40

Open
ntindle wants to merge 15 commits into
mainfrom
feat/ranking-metrics
Open

feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40
ntindle wants to merge 15 commits into
mainfrom
feat/ranking-metrics

Conversation

@ntindle
Copy link
Copy Markdown
Member

@ntindle ntindle commented May 12, 2026

Intent

AgentProbe needs measurement tooling that lights up before AutoGPT's
dream-system roadmap items land, so we can tell whether each step is
working as it ships. This PR builds out the scoring tooling end-to-end
for retrieval ranking (already shipped to AutoGPT as P-1) and for the
next four items on the roadmap, plus a dashboard surface that renders
the results.

Roadmap Scorer added Hard-fail signal
P-1.3 retract-vs-soft-delete (Snodgrass bi-temporal) demotion timestamp_discipline violation
P0.3a stale-fact deprecation demotion set precision/recall under threshold
P0.3b scoped cascading expiry demotion (cascade block) cascade_bounded=false
P0.6 retrieval-relevance benchmark retrieval forbidden-item hit; weighted score under threshold
P1 procedure synthesis + Save-as-Skill procedure weighted score under threshold
P2 memory dedup / near-duplicate merge dedup weighted score under threshold

The intent is that as AutoGPT ships each item, the corresponding
AgentProbe scenarios go from "fixture-mode, scoring-the-math" to
"raw-exchange-mode, scoring-the-real-pipeline" with a one-line YAML
swap — no further scorer work needed.

Behavior changes

Scenarios may now declare any combination of four scorer blocks:
retrieval:, demotion:, procedure:, dedup:. Without any of them,
behavior is unchanged: the LLM judge remains the only signal. With any
of them present, the runner scores each and ANDs the pass flags into
the overall scenario passed:

passed = judge.passed
       && (retrieval.passed ?? true)
       && (demotion.passed  ?? true)
       && (procedure.passed ?? true)
       && (dedup.passed     ?? true)

Each scorer is configured at the YAML level (per-metric weights, pass
threshold, fixture vs. raw-exchange source resolution) — same shape as
the existing retrieval: block, no new architectural concepts.

Persistence schema:

  • SQLite: SCHEMA_VERSION 8 → 10 (two bumps: 9 for retrieval_scores,
    10 for demotion_scores + procedure_scores + dedup_scores).
  • Postgres: POSTGRES_TARGET_VERSION 4 → 6 (two bumps: 5 for retrieval,
    6 for the dream-validation trio).
  • All migrations are additive create table if not exists. Drizzle
    schema inventory test and db.test.ts assertions updated.

Dashboard:

  • New EvalScoresView component renders all four scorer kinds with
    per-metric bars, weight indicators, aggregate weightedScore /
    threshold / pass-pill, and predicted-vs-golden side-by-side panels
    for procedures and dedup clusters.
  • Surfaced in two places: an "Eval scores" tab in DetailPanel
    (only shown when scores exist), and a third card on the
    ScenarioDetailView route.

Pre-existing failures cleared

Bonus: fixed two pre-existing problems on main that were blocking
bun run ci:

  1. React 18 vs 19 types collision in Markdown.tsx. Upstream
    react-markdown bundles its own React 18 types; project uses React
    19. Every JSX intrinsic component handler typecheck-failed with
    "Two different types with this name exist." Replaced the imported
    Components type with a project-local one built from the project's
    own JSX namespace.
  2. Duplicate React module instances between root and dashboard
    workspace. bun was installing two physical copies; dashboard test
    files imported React from root and dashboard components from
    dashboard/node_modules/react, so resolveDispatcher() returned
    null at runtime. All 7 React-rendering dashboard tests crashed.
    Moved react/react-dom from dashboard's dependencies to
    peerDependencies; bun now resolves both surfaces to the workspace
    root copy.

bun test is now 330 / 330 green for the first time on this branch.

How it's wired

Five layers, all pure-functional at the bottom and progressively
wired up through the existing run-loop / persistence pipeline:

  1. Pure math — colocated *.test.ts:

    • ranking.ts — precision@k, recall@k, MRR, NDCG@k.
    • clustering.ts — pairwise P/R/F1, Adjusted Rand Index
      (Hubert-Arabie 1985).
    • procedure-match.ts — step-coverage F1, LCS-normalized order
      similarity, parameter Jaccard.
    • demotion-match.ts — set-precision/recall, Snodgrass timestamp
      discipline check, single-hop cascade bound.
  2. Scorers{retrieval,demotion,procedure,dedup}-scorer.ts. Each
    resolves payload from fixture or rawExchange[<key>], coerces to a
    canonical shape, delegates to the math, returns a typed Score.

  3. YAML loaderload-suite.ts strictly parses each new block:
    non-empty golden, valid weight keys, threshold ∈ [0,1].

  4. RunnerrunScenario scores all four kinds after the judge
    phase, records via four new recorder hooks, and threads the score
    onto ScenarioRunResult.

  5. Persistence — four tables (retrieval_scores, demotion_scores,
    procedure_scores, dedup_scores), all mirroring the
    one-row-per-metric shape of judge_dimension_scores. SQLite +
    Postgres recorders both implement the four record*Result hooks.

  6. DashboardEvalScoresView rendered into existing surfaces;
    ScenarioDetail and ServerScenario types extended with the four
    score arrays.

Scenarios shipped

Retrieval (5)data/retrieval-memory.yaml:

  • mem-retrieval-forget-on-request (ranking sibling of judge-only
    mem-negative-forget-on-request)
  • mem-retrieval-warm-context-sarah
  • mem-retrieval-stale-fact-demotion
  • mem-retrieval-scope-filter-project
  • mem-retrieval-cascading-expiry

Dream-validation (9)data/dream-validation.yaml:

  • Snodgrass retract discipline (positive + negative)
  • Stale-fact demotion
  • Single-hop cascade (positive + negative)
  • Two procedure-extraction workflows
  • Dedup near-duplicate merge (positive + over-merge negative)

Every scenario points at a JSON fixture under
data/fixtures/{retrieval,dream}/. Pack-level tests pin each fixture
against the math; negative scenarios actively prove the scorer catches
the intended failure mode.

Scope rails

  • No new env vars. No new top-level deps.
  • Adapter abstractions untouched. The scorers read rawExchange data
    the adapter already returns; no EndpointAdapter shape change.
  • No vendoring of additional AutoGPT-side code.

Risk

  • Persistence migration touches both backends. All migrations are
    additive create table if not exists; forward is safe.
  • Scenarios with a scorer block but no fixture and no adapter
    payload score source: "missing" and fail.
    Intentional — silently
    passing a "we got nothing" run would mask regressions.
  • The overall passed is now AND'd across up to five signals.
    A scenario that previously passed via judge alone and now adds a
    failing scorer block will see its overall pass flip.
  • Negative-test scenarios are designed to fail. Pack-level tests
    pin both directions, so any regression that silently turns a negative
    into a positive trips CI.
  • The React peer-dep + Markdown type changes affect dashboard
    build infrastructure. Vite build still produces a working single-file
    bundle; tests all pass; typecheck clean. Worth eyeballing in review.

Out of scope (deferred)

  • Adapter-side payload emission for retrieval / demotion /
    procedure / dedup is AutoGPT-side work. The fixture path exists so
    the scorers are exercisable now.
  • CI lane that boots autogpt_platform via docker-compose, runs
    AgentProbe against it, and fails the PR on regression. Highest
    leverage next step on the AgentProbe side.
  • Web-fact-check P/R scorer (P0.5). Same shape as the demotion
    scorer; bottleneck is the curated (claim, current_truth) dataset.
  • Multi-judge aggregation. Pre-existing config-only feature; out
    of scope.

Validation

  • bun test: 330 pass / 0 fail.
  • bun run lint: clean.
  • bun run typecheck: clean (root + dashboard).
  • bun run docs:validate: clean.
  • dashboard/dist/index.html rebuilt and committed.

Checklist

  • bun run lint clean.
  • bun run typecheck clean (root + dashboard).
  • bun run docs:validate clean.
  • bun test all green (was 7 pre-existing failures).
  • No new env vars.
  • No secrets committed.
  • Behavior docs updated.
  • Dashboard bundle rebuilt.

Commits

  • feat(evaluation): add ranking metrics primitives
  • feat(evaluation): add retrieval scorer kind and YAML schema
  • feat(evaluation): wire retrieval scorer into runScenario
  • feat(persistence): persist retrieval scores in sqlite and postgres
  • feat(reporting): expose retrieval metrics in run reports
  • feat(data): add five ranking-scored memory scenarios
  • docs(specs): document ranking-scored retrieval scenarios
  • feat(evaluation): add dedup, procedure, demotion math primitives
  • feat(evaluation): add demotion, procedure, dedup scorer kinds
  • feat(persistence): persist demotion, procedure, dedup scores
  • feat(data): add nine dream-system validation scenarios
  • feat(dashboard): surface eval scores + fix React 18/19 types collision

ntindle added 11 commits May 12, 2026 14:27
Add precision@k, recall@k, MRR, and NDCG@k as pure functions in a new
`src/domains/evaluation/ranking.ts` module along with a `scoreRanking`
aggregator and supporting helpers (relevance vector construction, unique
gold-hit counting, forbidden-item detection).

These primitives are the load-bearing math for a forthcoming YAML
`retrieval:` scorer that grades "given a query, did the right memories
come back in the right order" — a question AgentProbe's existing
LLM-judge over a transcript cannot answer directly.

The math is pinned by 25 unit tests covering known-answer cases for
each metric (including the NDCG@3 of [1, 0, 1] = 1.5 / 1.6309 case),
edge behaviour around `k` clamping, the three match policies
(exact / substring / regex), and the forbidden-item override path.
No I/O, no LLM calls — the scorer wiring lands in a follow-up commit.
Introduce a new `retrieval:` block on scenarios that grades the
top-k of a returned items list against a curated golden set, plus an
optional forbidden list whose hits force a fail (used for forget /
scope / demotion probes).

Schema (parsed in `src/domains/validation/load-suite.ts`):

    retrieval:
      golden:    [<text-or-id>, ...]            # required, non-empty
      forbidden: [<text-or-id>, ...]            # optional
      k: 5                                       # optional, default max(|returned|,|golden|,1)
      pass_threshold: 0.5                        # optional, default 0.5
      match: substring | exact | regex           # optional, default substring
      weight:                                    # optional, defaults all to 1
        precision_at_k: 1.0
        recall_at_k:    1.0
        mrr:            1.0
        ndcg_at_k:      1.0
      source:                                    # optional
        raw_exchange_key: "retrieved"            #   default raw_exchange field
        fixture: "fixtures/foo.json"             #   resolved relative to YAML

The scorer (`src/domains/evaluation/retrieval-scorer.ts`) resolves the
retrieved list from either a JSON fixture file or the last adapter
reply's `rawExchange[<key>]`, coerces object payloads to strings via
label/text/name/id heuristics, and delegates to the pure ranking math
module. `RetrievalScore` mirrors the shape of `RubricScore` so it can
slot into `ScenarioRunResult` alongside the judge score.

Tests pin: weight/threshold/k validation in the loader, payload
coercion across string/object/mixed inputs, fixture resolution
relative to the scenario YAML, the `missing` source case where no
items can be located, and the forbidden-hit override path.

No runtime wiring yet — `run-suite.ts` lands in a follow-up commit so
each step stays independently green.
Run the retrieval scorer after the LLM-judge phase. The retrieved
list is resolved from the last adapter reply's `rawExchange.retrieved`
(or a custom key) or from a JSON fixture, and the resulting
`RetrievalScore` is attached to `ScenarioRunResult.retrievalScore`.

The overall `passed` flag now requires both signals — when a scenario
has a `retrieval:` block, a failed retrieval forces a scenario fail
even if the judge passed. This matches the brief's "forbidden item
forces a fail" semantics for forget / scope-filter / demotion probes.

A new `recordRetrievalResult` recorder hook is exposed alongside the
existing `recordJudgeResult` so persistence backends can store the
result; the SQLite/Postgres recorders implement it in the next commit.

Two new runner-level tests cover (1) a perfect rawExchange retrieval
that surfaces a passed score and (2) a forbidden-item hit that flips
the overall pass to false even when the judge votes pass.
Add a new `retrieval_scores` table to both backends, mirroring the
shape of `judge_dimension_scores`. One row per metric per scenario
run (precision@k, recall@k, MRR, NDCG@k) plus aggregate columns
(k, weighted_score, pass_threshold, passed, totals, source).

Schema rev:
 - SQLite: SCHEMA_VERSION 8 -> 9; baseline DDL adds the table, plus a
   migration step in both `migrations/sqlite.ts` and the legacy
   in-recorder migrator at `sqlite-run-history.ts:migrateDatabase`.
 - Postgres: POSTGRES_TARGET_VERSION 4 -> 5; baseline DDL adds the
   table; migration `from < 5` block in `migrations/postgres.ts`.

Drizzle schemas (sqlite + postgres) register the table and the
existing inventory test was updated to expect it. Both recorders
implement `recordRetrievalResult` and both `getRun` paths now return
`retrievalScores: Array<...>` on each scenario record so dashboard
and report consumers can read the data back.

Existing schema/version assertions in `tests/unit/db.test.ts` and
`tests/unit/persistence/migrations.test.ts` were updated to match.
Surface persisted retrieval scores in the rendered run report and
trim a handful of Biome formatting issues across the touched files.

`prepareScenarioView` now emits a `retrieval_rows` array (one row per
metric, with pretty value/percent labels) and a `retrieval_scores_pretty`
JSON blob alongside the existing judge dimension rows. Templates that
already exist will see no change until they reference the new keys;
this commit deliberately stays close to the existing dimension-row
pattern so the dashboard / HTML report can pick it up incrementally.

Also makes `ScenarioRecord.retrievalScores` optional so callers that
read older runs persisted before the schema upgrade don't trip
typecheck. The recorders and getRun paths still write/return the
field consistently for new runs.
Add a new `data/retrieval-memory.yaml` pack with five scenarios that
exercise dream-pass behaviors via the new ranking scorer:

- `mem-retrieval-forget-on-request`: ranking sibling of
  `mem-negative-forget-on-request` from the existing
  multi-session-memory pack. The judge variant asserts the agent's
  *text* doesn't leak the $50K budget; this variant asserts the
  budget figure is absent from the retrieval payload's top-k.

- `mem-retrieval-warm-context-sarah`: warm-context relevance.
  Given a working session that establishes Sarah and the Atlas
  project, a follow-up query should surface both gold facts near
  the top of the returned set, graded by precision/recall/MRR/NDCG.

- `mem-retrieval-stale-fact-demotion`: dream-pass stale-fact
  deprecation as a ranking assertion. Old pricing should be
  demoted (superseded) and excluded from the top-k; new pricing
  should appear. The committed fixture intentionally still
  includes the superseded item so the scenario fails — this
  documents the negative-test intent and proves the forbidden-hit
  check is active.

- `mem-retrieval-scope-filter-project`: typed-edge filtering.
  A `project:atlas`-scoped query must return only Atlas memories;
  Beacon, invoicing, and other unrelated scopes are forbidden.

- `mem-retrieval-cascading-expiry`: bounded cascade. Invalidating
  a NorthStar lead should remove Marcus / NorthStar facts but not
  the adjacent operational basics (CRM, fiscal year, invoicing
  rule). Validates the cascade is tight to the entity.

Each scenario points at a JSON fixture under
`data/fixtures/retrieval/` so the pack can be exercised against the
scorer math without a live AutoGPT backend. When running against a
real adapter that emits retrieval payloads inline, swap
`source.fixture` for `source.raw_exchange_key`; the rest of the
config carries over.

The accompanying `tests/unit/retrieval-memory.test.ts` pack pins
schema validity (every scenario carries a non-empty `retrieval`
block, references a known memory rubric, and the fixture file
exists relative to the YAML) plus scoring-math expectations against
each fixture.
Add the new "Ranking-scored scenarios grade retrieval relevance against
a curated golden set" scenario to `platform.md`, mark it implemented in
`current-state.md`, and wire it into `e2e-checklist.md` pointing at the
ranking math tests, the retrieval-scorer tests, the YAML-pack tests,
and the runner integration test that covers the rawExchange and
forbidden-hit paths.

Regenerated `generated/workspace-inventory.md` and `QUALITY_SCORE.md`
so the repo doc-validation gate is green.
Pure-math primitives for three new scorers that AgentProbe will need to
validate the dream-system roadmap items as they land:

- `clustering.ts`: pairwise precision/recall/F1 and Adjusted Rand Index
  (Hubert-Arabie 1985) over partitions of item IDs. Used to grade
  memory-dedup passes (P2): did the dedup pass cluster
  near-duplicates correctly?

- `procedure-match.ts`: step-coverage F1, LCS-normalized order
  similarity, parameter-coverage Jaccard, plus a `scoreProcedure`
  aggregator. Used to grade procedure extraction (P1 / `dreaming-procedures.md`):
  did the dream pass extract the same ordered set of steps and
  parameters as the curated golden procedure?

- `demotion-match.ts`: structural side of demotion correctness.
  `assertExpectedSet` for set-level demotion P/R, `assertTimestampDiscipline`
  for the Snodgrass retract-vs-soft-delete split (P-1.3), and
  `assertCascadeBounded` for the single-hop cascade discipline
  (P0.3b). `scoreDemotion` aggregates and treats timestamp
  violations and cascade overflow as hard failures, regardless of
  the weighted score.

40 unit tests pin known-answer cases: ARI of `{ {a,c},{b,d} }` vs
`{ {a,b},{c,d} }` is -0.5; LCS of the classic CLRS ABCBDAB/BDCAB pair
is 4; the 3-hop runaway-cascade test reproduces the P0.3b failure
mode flagged in `p0-spec.md` §4.

These are the load-bearing math for forthcoming scorer kinds
(`dedup`, `procedure`, `demotion`) that will land alongside YAML
schema, persistence, and scenarios in follow-up commits — same
pattern PR #40 used for the `retrieval` scorer.
Three new scorer kinds, each YAML-declarable as a sibling of the
existing `retrieval:` block, each with fixture + rawExchange source
resolution, each persisted alongside the run.

YAML schema additions (all parsed strictly in load-suite.ts):

    demotion:
      expected_demotions: [<uuid>, ...]            # required, non-empty
      expected_retracts:  [<uuid>, ...]            # optional
      cascade:
        expected_direct_neighbors: [<edge_uuid>, ...]
        tangential_edges:          [<edge_uuid>, ...]
      pass_threshold: 0.6
      weight: { set_precision, set_recall, set_f1,
                timestamp_discipline, cascade_bounded, cascade_direct_f1 }
      source: { fixture | raw_exchange_key }       # default key: "demotions"

    procedure:
      golden_steps: [<step_label>, ...]            # required, ordered
      golden_parameters: [<param_name>, ...]       # optional
      pass_threshold: 0.6
      weight: { step_coverage, step_order, parameter_coverage }
      source: { fixture | raw_exchange_key }       # default key: "procedure"

    dedup:
      golden_clusters: [[<uuid>, ...], ...]        # required, non-empty
      pass_threshold: 0.6
      weight: { precision, recall, f1, ari }
      source: { fixture | raw_exchange_key }       # default key: "dedup"

Each scorer module follows the same shape as retrieval-scorer.ts:
- coerce* utility extracts canonical payload from JSON or rawExchange
- resolve* threads fixture / rawExchange / missing source resolution
- score* delegates to the pure math module and assembles a typed score

`runScenario` calls all four scorers after the judge phase. The
overall scenario `passed` is now the AND of judge, retrieval, demotion,
procedure, and dedup pass flags. Each defaults to true when its
block is absent on the scenario, so this is backwards-compatible.

The pure math (clustering ARI / pairwise P/R, procedure LCS + Jaccard,
demotion set-precision + timestamp-discipline + cascade-bounded) was
landed in the prior commit and is fully unit-pinned. This commit wires
it through the same loader / scorer / recorder pipeline PR #40 used
for retrieval.

Persistence backends and report rendering land in follow-up commits.
Three new tables that mirror retrieval_scores from PR #40, one per
scorer kind:

- `demotion_scores` — one row per metric (set_precision, set_recall,
  set_f1, timestamp_discipline, cascade_bounded, cascade_direct_f1)
  plus aggregate cols (weighted_score, pass_threshold, passed,
  timestamp_violation_count, cascade_bounded, source) and JSON
  blobs (observed_json, expected_json) for replay.

- `procedure_scores` — one row per metric (step_coverage, step_order,
  parameter_coverage) plus aggregate cols and predicted/golden JSON
  blobs.

- `dedup_scores` — one row per metric (precision, recall, f1, ari)
  plus item_count, predicted/golden cluster JSON blobs.

Schema rev:
 - SQLite: SCHEMA_VERSION 9 -> 10; baseline DDL ships the three new
   tables, plus a v10 migration step in both `migrations/sqlite.ts`
   and the legacy in-recorder migrator at
   `sqlite-run-history.ts:migrateDatabase` (matched ensure-helpers).
 - Postgres: POSTGRES_TARGET_VERSION 5 -> 6; baseline DDL adds the
   tables; migration `from < 6` block in `migrations/postgres.ts`.

Both backends now implement `recordDemotionResult`,
`recordProcedureResult`, `recordDedupResult` recorder hooks. Both
`getRun` paths now return `demotionScores`, `procedureScores`, and
`dedupScores` arrays on each `ScenarioRecord`. Drizzle schema
inventory test and `db.test.ts` schema/version assertions updated.

The Drizzle schema, in-recorder DDL, and migration step are
deliberately redundant — the project ships two parallel SQLite paths
(legacy in-recorder + drizzle-shaped) and both must agree per the
existing inventory test.
Add `data/dream-validation.yaml` covering the four roadmap items the
new scorer kinds were built to grade:

P-1.3 retract-vs-soft-delete (Snodgrass bi-temporal):
  - dream-demotion-retract-discipline (passes against compliant fixture)
  - dream-demotion-snodgrass-violation (negative: fixture sets both
    timestamps; scenario MUST fail on timestamp_discipline)

P0.3a stale-fact deprecation:
  - dream-demotion-stale-fact (pricing edge correctly demoted to
    superseded)

P0.3b scoped cascading expiry:
  - dream-demotion-cascade-bounded (single-hop discipline held)
  - dream-demotion-cascade-runaway (negative: fixture touches a 2+ hop
    edge; scenario MUST fail on cascade_bounded)

P1 procedure synthesis (per `dreaming-procedures.md`):
  - dream-procedure-weekly-report (4 ordered steps + 2 parameters)
  - dream-procedure-client-onboarding (different workflow, exercises
    the scorer on a non-degenerate case)

P2 memory dedup:
  - dream-dedup-near-duplicates (clean merge of Sarah-billing facts;
    HubSpot + fiscal-year stay singletons)
  - dream-dedup-false-positive (negative: over-merges Sarah-billing
    with Sarah-manager; scenario MUST fail on ARI + pairwise precision)

Eight JSON fixtures under `data/fixtures/dream/` back the offline path;
each scenario also accepts a `source.raw_exchange_key` swap when
AutoGPT's dream pass starts emitting structured payloads inline.

Pack-level pinning in `tests/unit/dream-validation.test.ts` asserts:
- ship-shape counts (>=4 demotion, >=2 procedure, >=2 dedup)
- every fixture exists on disk relative to the YAML
- happy-path scenarios pass against their golden
- negative scenarios fail in the expected way (timestamp_discipline,
  cascade_bounded, over-merge)

Docs updated: `platform.md` gets a new spec scenario, `current-state.md`
marks it implemented, `e2e-checklist.md` references the test files,
generated workspace inventory + quality score refreshed.
@ntindle ntindle changed the title feat: add retrieval ranking scorer + five memory scenarios feat: add ranking + demotion + procedure + dedup scorer kinds for dream-system validation May 13, 2026
Three dashboard changes that go together:

1. **EvalScoresView** (`dashboard/src/components/EvalScoresView.tsx`) —
   new component that renders retrieval / demotion / procedure / dedup
   scores in the existing detail surface. Per-metric bars with the
   weight indicator, a header per scorer kind with aggregate
   weightedScore / pass_threshold / pass-pill / source-of-payload,
   and predicted-vs-golden side-by-side panels for procedures
   (ordered step lists) and dedup (cluster lists).

2. **Two surfaces wired up:**
   - `DetailPanel`: new "Eval scores" tab next to Conversation /
     Rubric, only shown when the scenario actually has eval scores.
   - `ScenarioDetailView`: new card under the conversation/rubric
     two-column layout, rendered when `hasEvalScores(detail)`.

3. **Markdown.tsx React-types fix.** Upstream `react-markdown` ships
   its own bundled React 18 types, which collide with the project's
   React 19 — every JSX intrinsic component handler typecheck-fails
   with "Two different types with this name exist". Replaced the
   imported `Components` map with a project-local `MarkdownComponents`
   built off the project's own JSX namespace, and cast through `never`
   at the call site. Cleanly typechecks under both React-types
   bundles.

4. **React workspace dedup.** `dashboard/package.json` declares
   react/react-dom as `peerDependencies` instead of `dependencies`.
   Bun was installing two physical copies (root + `dashboard/`),
   so dashboard components imported via test files saw a *different*
   React module instance than the test's own `React.act` / `useState`
   call. Resolved dispatcher → null, dashboard tests crashed at
   `resolveDispatcher().useState`. Removing the duplicate install
   makes bun resolve react/react-dom to the workspace root for both
   surfaces, and the dashboard test suite (`dashboard-app.test.tsx`,
   `compare-view.test.tsx`) now passes all 7 previously-failing tests.

Result: full suite is 330 / 330 green. `bun run typecheck` clean
(was failing on Markdown.tsx). `bun run lint` clean.
@ntindle ntindle changed the title feat: add ranking + demotion + procedure + dedup scorer kinds for dream-system validation feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install May 13, 2026
ntindle added 3 commits May 13, 2026 15:49
Bundles all memory-related scenarios into one default preset that's
seeded into every fresh server boot alongside the existing Pre Release
Checks preset:

- 31 multi-session-memory scenarios (judge-scored, the existing
  retention/distillation/rigidity/abstention/temporal/continuation/
  cross-domain/procedural-update/compositional/introspection/
  long-tail/hygiene/negative pack)
- 5 retrieval-memory scenarios (precision@k/recall@k/MRR/NDCG@k
  against curated golden sets)
- 9 dream-validation scenarios (demotion / procedure / dedup scorers
  covering the P-1 -> P2 dream-system roadmap)

Total: 45 scenarios across the three vendored memory packs. Selection
is pinned per-file with explicit IDs so the preset is stable across
scenario reorders and additive YAML changes; new scenarios get added
to the preset by editing the three ID arrays.

Server boot logs `created default preset Full Memory Suite` the first
time it sees a fresh DB. Idempotent on subsequent boots (status
becomes `existing`), and a soft-delete is restored on the next boot
following the same pattern as Pre Release Checks.

Also ship `scripts/seed-eval-scores.ts` from the dashboard-demo work:
a one-off seeder that writes a synthetic run with retrieval / demotion
/ procedure / dedup scores populated via the actual scorer modules,
so the dashboard EvalScoresView has real data to render in local-dev
demos. Reuses the published recorder API; no new shape introduced.

Default-presets test updated to assert both presets seed cleanly and
both go through the same created -> existing -> restored lifecycle.
Both Parallel and Dry-run fields exist on the Run Launch Modal and the
Preset Editor but neither explains what they do — a fresh user has no
way to know whether "Parallel" runs scenarios concurrently or
something else, or whether "Dry run" hits the LLM at all.

Use the existing `Field.hint` slot (already used for "Run name" and
"Notes") so the explanation renders inline below each control,
matching the surrounding UX pattern:

- **Parallel** — explains concurrency and that scenarios still
  complete; mentions LLM cost as the tradeoff.
- **Parallel limit** — recommends 2-4 and notes the cost-vs-speed tilt
  (only present in the launch modal; the preset editor combines both
  Parallel controls under a single hint).
- **Dry run** — explains exactly what dry-run skips (live adapter,
  judge, scorers) and what it preserves (run + scenario rows), so the
  user knows it's the right move for validating config without
  spending tokens.

Dashboard bundle rebuilt (`dashboard/dist/index.html`) so the running
single-file server picks up the new hints.
The /start Run Builder had Parallel and Dry-run as bare Checkbox
elements outside any Field, so the inline-hint pattern used elsewhere
on the page didn't apply. Wrap each in a Field with hint text matching
the pattern used on the RunLaunchModal and PresetEditorView:

- Parallel limit: existing TextInput Field gets a hint about
  cost/latency tradeoff (2-4 typical).
- Dry run: records run + scenario rows but skips adapter, judge,
  scorers — use for config validation without spending tokens.
- Parallel: explains concurrency and ties back to the limit field.
- Save as preset: explains the persistence side effect.

Dashboard bundle rebuilt so the running single-file server picks up
the new hints on next page load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant