feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40
Open
ntindle wants to merge 15 commits into
Open
feat: add 4 scorer kinds (retrieval, demotion, procedure, dedup) + dashboard surface + dedup React install#40ntindle wants to merge 15 commits into
ntindle wants to merge 15 commits into
Conversation
Add precision@k, recall@k, MRR, and NDCG@k as pure functions in a new `src/domains/evaluation/ranking.ts` module along with a `scoreRanking` aggregator and supporting helpers (relevance vector construction, unique gold-hit counting, forbidden-item detection). These primitives are the load-bearing math for a forthcoming YAML `retrieval:` scorer that grades "given a query, did the right memories come back in the right order" — a question AgentProbe's existing LLM-judge over a transcript cannot answer directly. The math is pinned by 25 unit tests covering known-answer cases for each metric (including the NDCG@3 of [1, 0, 1] = 1.5 / 1.6309 case), edge behaviour around `k` clamping, the three match policies (exact / substring / regex), and the forbidden-item override path. No I/O, no LLM calls — the scorer wiring lands in a follow-up commit.
Introduce a new `retrieval:` block on scenarios that grades the
top-k of a returned items list against a curated golden set, plus an
optional forbidden list whose hits force a fail (used for forget /
scope / demotion probes).
Schema (parsed in `src/domains/validation/load-suite.ts`):
retrieval:
golden: [<text-or-id>, ...] # required, non-empty
forbidden: [<text-or-id>, ...] # optional
k: 5 # optional, default max(|returned|,|golden|,1)
pass_threshold: 0.5 # optional, default 0.5
match: substring | exact | regex # optional, default substring
weight: # optional, defaults all to 1
precision_at_k: 1.0
recall_at_k: 1.0
mrr: 1.0
ndcg_at_k: 1.0
source: # optional
raw_exchange_key: "retrieved" # default raw_exchange field
fixture: "fixtures/foo.json" # resolved relative to YAML
The scorer (`src/domains/evaluation/retrieval-scorer.ts`) resolves the
retrieved list from either a JSON fixture file or the last adapter
reply's `rawExchange[<key>]`, coerces object payloads to strings via
label/text/name/id heuristics, and delegates to the pure ranking math
module. `RetrievalScore` mirrors the shape of `RubricScore` so it can
slot into `ScenarioRunResult` alongside the judge score.
Tests pin: weight/threshold/k validation in the loader, payload
coercion across string/object/mixed inputs, fixture resolution
relative to the scenario YAML, the `missing` source case where no
items can be located, and the forbidden-hit override path.
No runtime wiring yet — `run-suite.ts` lands in a follow-up commit so
each step stays independently green.
Run the retrieval scorer after the LLM-judge phase. The retrieved list is resolved from the last adapter reply's `rawExchange.retrieved` (or a custom key) or from a JSON fixture, and the resulting `RetrievalScore` is attached to `ScenarioRunResult.retrievalScore`. The overall `passed` flag now requires both signals — when a scenario has a `retrieval:` block, a failed retrieval forces a scenario fail even if the judge passed. This matches the brief's "forbidden item forces a fail" semantics for forget / scope-filter / demotion probes. A new `recordRetrievalResult` recorder hook is exposed alongside the existing `recordJudgeResult` so persistence backends can store the result; the SQLite/Postgres recorders implement it in the next commit. Two new runner-level tests cover (1) a perfect rawExchange retrieval that surfaces a passed score and (2) a forbidden-item hit that flips the overall pass to false even when the judge votes pass.
Add a new `retrieval_scores` table to both backends, mirroring the shape of `judge_dimension_scores`. One row per metric per scenario run (precision@k, recall@k, MRR, NDCG@k) plus aggregate columns (k, weighted_score, pass_threshold, passed, totals, source). Schema rev: - SQLite: SCHEMA_VERSION 8 -> 9; baseline DDL adds the table, plus a migration step in both `migrations/sqlite.ts` and the legacy in-recorder migrator at `sqlite-run-history.ts:migrateDatabase`. - Postgres: POSTGRES_TARGET_VERSION 4 -> 5; baseline DDL adds the table; migration `from < 5` block in `migrations/postgres.ts`. Drizzle schemas (sqlite + postgres) register the table and the existing inventory test was updated to expect it. Both recorders implement `recordRetrievalResult` and both `getRun` paths now return `retrievalScores: Array<...>` on each scenario record so dashboard and report consumers can read the data back. Existing schema/version assertions in `tests/unit/db.test.ts` and `tests/unit/persistence/migrations.test.ts` were updated to match.
Surface persisted retrieval scores in the rendered run report and trim a handful of Biome formatting issues across the touched files. `prepareScenarioView` now emits a `retrieval_rows` array (one row per metric, with pretty value/percent labels) and a `retrieval_scores_pretty` JSON blob alongside the existing judge dimension rows. Templates that already exist will see no change until they reference the new keys; this commit deliberately stays close to the existing dimension-row pattern so the dashboard / HTML report can pick it up incrementally. Also makes `ScenarioRecord.retrievalScores` optional so callers that read older runs persisted before the schema upgrade don't trip typecheck. The recorders and getRun paths still write/return the field consistently for new runs.
Add a new `data/retrieval-memory.yaml` pack with five scenarios that exercise dream-pass behaviors via the new ranking scorer: - `mem-retrieval-forget-on-request`: ranking sibling of `mem-negative-forget-on-request` from the existing multi-session-memory pack. The judge variant asserts the agent's *text* doesn't leak the $50K budget; this variant asserts the budget figure is absent from the retrieval payload's top-k. - `mem-retrieval-warm-context-sarah`: warm-context relevance. Given a working session that establishes Sarah and the Atlas project, a follow-up query should surface both gold facts near the top of the returned set, graded by precision/recall/MRR/NDCG. - `mem-retrieval-stale-fact-demotion`: dream-pass stale-fact deprecation as a ranking assertion. Old pricing should be demoted (superseded) and excluded from the top-k; new pricing should appear. The committed fixture intentionally still includes the superseded item so the scenario fails — this documents the negative-test intent and proves the forbidden-hit check is active. - `mem-retrieval-scope-filter-project`: typed-edge filtering. A `project:atlas`-scoped query must return only Atlas memories; Beacon, invoicing, and other unrelated scopes are forbidden. - `mem-retrieval-cascading-expiry`: bounded cascade. Invalidating a NorthStar lead should remove Marcus / NorthStar facts but not the adjacent operational basics (CRM, fiscal year, invoicing rule). Validates the cascade is tight to the entity. Each scenario points at a JSON fixture under `data/fixtures/retrieval/` so the pack can be exercised against the scorer math without a live AutoGPT backend. When running against a real adapter that emits retrieval payloads inline, swap `source.fixture` for `source.raw_exchange_key`; the rest of the config carries over. The accompanying `tests/unit/retrieval-memory.test.ts` pack pins schema validity (every scenario carries a non-empty `retrieval` block, references a known memory rubric, and the fixture file exists relative to the YAML) plus scoring-math expectations against each fixture.
Add the new "Ranking-scored scenarios grade retrieval relevance against a curated golden set" scenario to `platform.md`, mark it implemented in `current-state.md`, and wire it into `e2e-checklist.md` pointing at the ranking math tests, the retrieval-scorer tests, the YAML-pack tests, and the runner integration test that covers the rawExchange and forbidden-hit paths. Regenerated `generated/workspace-inventory.md` and `QUALITY_SCORE.md` so the repo doc-validation gate is green.
Pure-math primitives for three new scorers that AgentProbe will need to
validate the dream-system roadmap items as they land:
- `clustering.ts`: pairwise precision/recall/F1 and Adjusted Rand Index
(Hubert-Arabie 1985) over partitions of item IDs. Used to grade
memory-dedup passes (P2): did the dedup pass cluster
near-duplicates correctly?
- `procedure-match.ts`: step-coverage F1, LCS-normalized order
similarity, parameter-coverage Jaccard, plus a `scoreProcedure`
aggregator. Used to grade procedure extraction (P1 / `dreaming-procedures.md`):
did the dream pass extract the same ordered set of steps and
parameters as the curated golden procedure?
- `demotion-match.ts`: structural side of demotion correctness.
`assertExpectedSet` for set-level demotion P/R, `assertTimestampDiscipline`
for the Snodgrass retract-vs-soft-delete split (P-1.3), and
`assertCascadeBounded` for the single-hop cascade discipline
(P0.3b). `scoreDemotion` aggregates and treats timestamp
violations and cascade overflow as hard failures, regardless of
the weighted score.
40 unit tests pin known-answer cases: ARI of `{ {a,c},{b,d} }` vs
`{ {a,b},{c,d} }` is -0.5; LCS of the classic CLRS ABCBDAB/BDCAB pair
is 4; the 3-hop runaway-cascade test reproduces the P0.3b failure
mode flagged in `p0-spec.md` §4.
These are the load-bearing math for forthcoming scorer kinds
(`dedup`, `procedure`, `demotion`) that will land alongside YAML
schema, persistence, and scenarios in follow-up commits — same
pattern PR #40 used for the `retrieval` scorer.
Three new scorer kinds, each YAML-declarable as a sibling of the
existing `retrieval:` block, each with fixture + rawExchange source
resolution, each persisted alongside the run.
YAML schema additions (all parsed strictly in load-suite.ts):
demotion:
expected_demotions: [<uuid>, ...] # required, non-empty
expected_retracts: [<uuid>, ...] # optional
cascade:
expected_direct_neighbors: [<edge_uuid>, ...]
tangential_edges: [<edge_uuid>, ...]
pass_threshold: 0.6
weight: { set_precision, set_recall, set_f1,
timestamp_discipline, cascade_bounded, cascade_direct_f1 }
source: { fixture | raw_exchange_key } # default key: "demotions"
procedure:
golden_steps: [<step_label>, ...] # required, ordered
golden_parameters: [<param_name>, ...] # optional
pass_threshold: 0.6
weight: { step_coverage, step_order, parameter_coverage }
source: { fixture | raw_exchange_key } # default key: "procedure"
dedup:
golden_clusters: [[<uuid>, ...], ...] # required, non-empty
pass_threshold: 0.6
weight: { precision, recall, f1, ari }
source: { fixture | raw_exchange_key } # default key: "dedup"
Each scorer module follows the same shape as retrieval-scorer.ts:
- coerce* utility extracts canonical payload from JSON or rawExchange
- resolve* threads fixture / rawExchange / missing source resolution
- score* delegates to the pure math module and assembles a typed score
`runScenario` calls all four scorers after the judge phase. The
overall scenario `passed` is now the AND of judge, retrieval, demotion,
procedure, and dedup pass flags. Each defaults to true when its
block is absent on the scenario, so this is backwards-compatible.
The pure math (clustering ARI / pairwise P/R, procedure LCS + Jaccard,
demotion set-precision + timestamp-discipline + cascade-bounded) was
landed in the prior commit and is fully unit-pinned. This commit wires
it through the same loader / scorer / recorder pipeline PR #40 used
for retrieval.
Persistence backends and report rendering land in follow-up commits.
Three new tables that mirror retrieval_scores from PR #40, one per scorer kind: - `demotion_scores` — one row per metric (set_precision, set_recall, set_f1, timestamp_discipline, cascade_bounded, cascade_direct_f1) plus aggregate cols (weighted_score, pass_threshold, passed, timestamp_violation_count, cascade_bounded, source) and JSON blobs (observed_json, expected_json) for replay. - `procedure_scores` — one row per metric (step_coverage, step_order, parameter_coverage) plus aggregate cols and predicted/golden JSON blobs. - `dedup_scores` — one row per metric (precision, recall, f1, ari) plus item_count, predicted/golden cluster JSON blobs. Schema rev: - SQLite: SCHEMA_VERSION 9 -> 10; baseline DDL ships the three new tables, plus a v10 migration step in both `migrations/sqlite.ts` and the legacy in-recorder migrator at `sqlite-run-history.ts:migrateDatabase` (matched ensure-helpers). - Postgres: POSTGRES_TARGET_VERSION 5 -> 6; baseline DDL adds the tables; migration `from < 6` block in `migrations/postgres.ts`. Both backends now implement `recordDemotionResult`, `recordProcedureResult`, `recordDedupResult` recorder hooks. Both `getRun` paths now return `demotionScores`, `procedureScores`, and `dedupScores` arrays on each `ScenarioRecord`. Drizzle schema inventory test and `db.test.ts` schema/version assertions updated. The Drizzle schema, in-recorder DDL, and migration step are deliberately redundant — the project ships two parallel SQLite paths (legacy in-recorder + drizzle-shaped) and both must agree per the existing inventory test.
Add `data/dream-validation.yaml` covering the four roadmap items the
new scorer kinds were built to grade:
P-1.3 retract-vs-soft-delete (Snodgrass bi-temporal):
- dream-demotion-retract-discipline (passes against compliant fixture)
- dream-demotion-snodgrass-violation (negative: fixture sets both
timestamps; scenario MUST fail on timestamp_discipline)
P0.3a stale-fact deprecation:
- dream-demotion-stale-fact (pricing edge correctly demoted to
superseded)
P0.3b scoped cascading expiry:
- dream-demotion-cascade-bounded (single-hop discipline held)
- dream-demotion-cascade-runaway (negative: fixture touches a 2+ hop
edge; scenario MUST fail on cascade_bounded)
P1 procedure synthesis (per `dreaming-procedures.md`):
- dream-procedure-weekly-report (4 ordered steps + 2 parameters)
- dream-procedure-client-onboarding (different workflow, exercises
the scorer on a non-degenerate case)
P2 memory dedup:
- dream-dedup-near-duplicates (clean merge of Sarah-billing facts;
HubSpot + fiscal-year stay singletons)
- dream-dedup-false-positive (negative: over-merges Sarah-billing
with Sarah-manager; scenario MUST fail on ARI + pairwise precision)
Eight JSON fixtures under `data/fixtures/dream/` back the offline path;
each scenario also accepts a `source.raw_exchange_key` swap when
AutoGPT's dream pass starts emitting structured payloads inline.
Pack-level pinning in `tests/unit/dream-validation.test.ts` asserts:
- ship-shape counts (>=4 demotion, >=2 procedure, >=2 dedup)
- every fixture exists on disk relative to the YAML
- happy-path scenarios pass against their golden
- negative scenarios fail in the expected way (timestamp_discipline,
cascade_bounded, over-merge)
Docs updated: `platform.md` gets a new spec scenario, `current-state.md`
marks it implemented, `e2e-checklist.md` references the test files,
generated workspace inventory + quality score refreshed.
Three dashboard changes that go together:
1. **EvalScoresView** (`dashboard/src/components/EvalScoresView.tsx`) —
new component that renders retrieval / demotion / procedure / dedup
scores in the existing detail surface. Per-metric bars with the
weight indicator, a header per scorer kind with aggregate
weightedScore / pass_threshold / pass-pill / source-of-payload,
and predicted-vs-golden side-by-side panels for procedures
(ordered step lists) and dedup (cluster lists).
2. **Two surfaces wired up:**
- `DetailPanel`: new "Eval scores" tab next to Conversation /
Rubric, only shown when the scenario actually has eval scores.
- `ScenarioDetailView`: new card under the conversation/rubric
two-column layout, rendered when `hasEvalScores(detail)`.
3. **Markdown.tsx React-types fix.** Upstream `react-markdown` ships
its own bundled React 18 types, which collide with the project's
React 19 — every JSX intrinsic component handler typecheck-fails
with "Two different types with this name exist". Replaced the
imported `Components` map with a project-local `MarkdownComponents`
built off the project's own JSX namespace, and cast through `never`
at the call site. Cleanly typechecks under both React-types
bundles.
4. **React workspace dedup.** `dashboard/package.json` declares
react/react-dom as `peerDependencies` instead of `dependencies`.
Bun was installing two physical copies (root + `dashboard/`),
so dashboard components imported via test files saw a *different*
React module instance than the test's own `React.act` / `useState`
call. Resolved dispatcher → null, dashboard tests crashed at
`resolveDispatcher().useState`. Removing the duplicate install
makes bun resolve react/react-dom to the workspace root for both
surfaces, and the dashboard test suite (`dashboard-app.test.tsx`,
`compare-view.test.tsx`) now passes all 7 previously-failing tests.
Result: full suite is 330 / 330 green. `bun run typecheck` clean
(was failing on Markdown.tsx). `bun run lint` clean.
Bundles all memory-related scenarios into one default preset that's seeded into every fresh server boot alongside the existing Pre Release Checks preset: - 31 multi-session-memory scenarios (judge-scored, the existing retention/distillation/rigidity/abstention/temporal/continuation/ cross-domain/procedural-update/compositional/introspection/ long-tail/hygiene/negative pack) - 5 retrieval-memory scenarios (precision@k/recall@k/MRR/NDCG@k against curated golden sets) - 9 dream-validation scenarios (demotion / procedure / dedup scorers covering the P-1 -> P2 dream-system roadmap) Total: 45 scenarios across the three vendored memory packs. Selection is pinned per-file with explicit IDs so the preset is stable across scenario reorders and additive YAML changes; new scenarios get added to the preset by editing the three ID arrays. Server boot logs `created default preset Full Memory Suite` the first time it sees a fresh DB. Idempotent on subsequent boots (status becomes `existing`), and a soft-delete is restored on the next boot following the same pattern as Pre Release Checks. Also ship `scripts/seed-eval-scores.ts` from the dashboard-demo work: a one-off seeder that writes a synthetic run with retrieval / demotion / procedure / dedup scores populated via the actual scorer modules, so the dashboard EvalScoresView has real data to render in local-dev demos. Reuses the published recorder API; no new shape introduced. Default-presets test updated to assert both presets seed cleanly and both go through the same created -> existing -> restored lifecycle.
Both Parallel and Dry-run fields exist on the Run Launch Modal and the Preset Editor but neither explains what they do — a fresh user has no way to know whether "Parallel" runs scenarios concurrently or something else, or whether "Dry run" hits the LLM at all. Use the existing `Field.hint` slot (already used for "Run name" and "Notes") so the explanation renders inline below each control, matching the surrounding UX pattern: - **Parallel** — explains concurrency and that scenarios still complete; mentions LLM cost as the tradeoff. - **Parallel limit** — recommends 2-4 and notes the cost-vs-speed tilt (only present in the launch modal; the preset editor combines both Parallel controls under a single hint). - **Dry run** — explains exactly what dry-run skips (live adapter, judge, scorers) and what it preserves (run + scenario rows), so the user knows it's the right move for validating config without spending tokens. Dashboard bundle rebuilt (`dashboard/dist/index.html`) so the running single-file server picks up the new hints.
The /start Run Builder had Parallel and Dry-run as bare Checkbox elements outside any Field, so the inline-hint pattern used elsewhere on the page didn't apply. Wrap each in a Field with hint text matching the pattern used on the RunLaunchModal and PresetEditorView: - Parallel limit: existing TextInput Field gets a hint about cost/latency tradeoff (2-4 typical). - Dry run: records run + scenario rows but skips adapter, judge, scorers — use for config validation without spending tokens. - Parallel: explains concurrency and ties back to the limit field. - Save as preset: explains the persistence side effect. Dashboard bundle rebuilt so the running single-file server picks up the new hints on next page load.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
AgentProbe needs measurement tooling that lights up before AutoGPT's
dream-system roadmap items land, so we can tell whether each step is
working as it ships. This PR builds out the scoring tooling end-to-end
for retrieval ranking (already shipped to AutoGPT as P-1) and for the
next four items on the roadmap, plus a dashboard surface that renders
the results.
demotiontimestamp_disciplineviolationdemotiondemotion(cascade block)cascade_bounded=falseretrievalprocedurededupThe intent is that as AutoGPT ships each item, the corresponding
AgentProbe scenarios go from "fixture-mode, scoring-the-math" to
"raw-exchange-mode, scoring-the-real-pipeline" with a one-line YAML
swap — no further scorer work needed.
Behavior changes
Scenarios may now declare any combination of four scorer blocks:
retrieval:,demotion:,procedure:,dedup:. Without any of them,behavior is unchanged: the LLM judge remains the only signal. With any
of them present, the runner scores each and ANDs the pass flags into
the overall scenario
passed:Each scorer is configured at the YAML level (per-metric weights, pass
threshold, fixture vs. raw-exchange source resolution) — same shape as
the existing
retrieval:block, no new architectural concepts.Persistence schema:
retrieval_scores,10 for
demotion_scores+procedure_scores+dedup_scores).6 for the dream-validation trio).
create table if not exists. Drizzleschema inventory test and
db.test.tsassertions updated.Dashboard:
EvalScoresViewcomponent renders all four scorer kinds withper-metric bars, weight indicators, aggregate weightedScore /
threshold / pass-pill, and predicted-vs-golden side-by-side panels
for procedures and dedup clusters.
DetailPanel(only shown when scores exist), and a third card on the
ScenarioDetailViewroute.Pre-existing failures cleared
Bonus: fixed two pre-existing problems on
mainthat were blockingbun run ci:Markdown.tsx. Upstreamreact-markdown bundles its own React 18 types; project uses React
19. Every JSX intrinsic component handler typecheck-failed with
"Two different types with this name exist." Replaced the imported
Componentstype with a project-local one built from the project'sown JSX namespace.
workspace. bun was installing two physical copies; dashboard test
files imported React from root and dashboard components from
dashboard/node_modules/react, soresolveDispatcher()returnednull at runtime. All 7 React-rendering dashboard tests crashed.
Moved react/react-dom from dashboard's
dependenciestopeerDependencies; bun now resolves both surfaces to the workspaceroot copy.
bun testis now 330 / 330 green for the first time on this branch.How it's wired
Five layers, all pure-functional at the bottom and progressively
wired up through the existing run-loop / persistence pipeline:
Pure math — colocated
*.test.ts:ranking.ts— precision@k, recall@k, MRR, NDCG@k.clustering.ts— pairwise P/R/F1, Adjusted Rand Index(Hubert-Arabie 1985).
procedure-match.ts— step-coverage F1, LCS-normalized ordersimilarity, parameter Jaccard.
demotion-match.ts— set-precision/recall, Snodgrass timestampdiscipline check, single-hop cascade bound.
Scorers —
{retrieval,demotion,procedure,dedup}-scorer.ts. Eachresolves payload from fixture or
rawExchange[<key>], coerces to acanonical shape, delegates to the math, returns a typed
Score.YAML loader —
load-suite.tsstrictly parses each new block:non-empty
golden, valid weight keys, threshold ∈ [0,1].Runner —
runScenarioscores all four kinds after the judgephase, records via four new recorder hooks, and threads the score
onto
ScenarioRunResult.Persistence — four tables (
retrieval_scores,demotion_scores,procedure_scores,dedup_scores), all mirroring theone-row-per-metric shape of
judge_dimension_scores. SQLite +Postgres recorders both implement the four
record*Resulthooks.Dashboard —
EvalScoresViewrendered into existing surfaces;ScenarioDetailandServerScenariotypes extended with the fourscore arrays.
Scenarios shipped
Retrieval (5) —
data/retrieval-memory.yaml:mem-retrieval-forget-on-request(ranking sibling of judge-onlymem-negative-forget-on-request)mem-retrieval-warm-context-sarahmem-retrieval-stale-fact-demotionmem-retrieval-scope-filter-projectmem-retrieval-cascading-expiryDream-validation (9) —
data/dream-validation.yaml:Every scenario points at a JSON fixture under
data/fixtures/{retrieval,dream}/. Pack-level tests pin each fixtureagainst the math; negative scenarios actively prove the scorer catches
the intended failure mode.
Scope rails
rawExchangedatathe adapter already returns; no
EndpointAdaptershape change.Risk
additive
create table if not exists; forward is safe.payload score
source: "missing"and fail. Intentional — silentlypassing a "we got nothing" run would mask regressions.
passedis now AND'd across up to five signals.A scenario that previously passed via judge alone and now adds a
failing scorer block will see its overall pass flip.
pin both directions, so any regression that silently turns a negative
into a positive trips CI.
build infrastructure. Vite build still produces a working single-file
bundle; tests all pass; typecheck clean. Worth eyeballing in review.
Out of scope (deferred)
procedure / dedup is AutoGPT-side work. The fixture path exists so
the scorers are exercisable now.
autogpt_platformvia docker-compose, runsAgentProbe against it, and fails the PR on regression. Highest
leverage next step on the AgentProbe side.
scorer; bottleneck is the curated
(claim, current_truth)dataset.of scope.
Validation
bun test: 330 pass / 0 fail.bun run lint: clean.bun run typecheck: clean (root + dashboard).bun run docs:validate: clean.dashboard/dist/index.htmlrebuilt and committed.Checklist
bun run lintclean.bun run typecheckclean (root + dashboard).bun run docs:validateclean.bun testall green (was 7 pre-existing failures).Commits
feat(evaluation): add ranking metrics primitivesfeat(evaluation): add retrieval scorer kind and YAML schemafeat(evaluation): wire retrieval scorer into runScenariofeat(persistence): persist retrieval scores in sqlite and postgresfeat(reporting): expose retrieval metrics in run reportsfeat(data): add five ranking-scored memory scenariosdocs(specs): document ranking-scored retrieval scenariosfeat(evaluation): add dedup, procedure, demotion math primitivesfeat(evaluation): add demotion, procedure, dedup scorer kindsfeat(persistence): persist demotion, procedure, dedup scoresfeat(data): add nine dream-system validation scenariosfeat(dashboard): surface eval scores + fix React 18/19 types collision