Add human scoring workflow by Swiftyos · Pull Request #39 · Significant-Gravitas/AgentProbe

Swiftyos · 2026-05-11T10:53:56Z

Intent

Adds the human scoring workflow so reviewers can manually score completed scenario runs against rubric dimensions and compare those scores with automated judge output. The branch includes dashboard UI, HTTP API routes, SQLite/Postgres persistence support, migrations, and a seed script for test scores.

Behavior changes

Adds a dashboard human scoring view for selecting rubric dimensions, reviewing conversations, recording scores, and tracking rubric coverage.
Adds /api/human-scoring/rubrics, /api/human-scoring/next, and /api/human-scoring/scores server routes.
Persists human scoring data in SQLite and Postgres backends, including migration/schema updates.
Refreshes rubric/scenario data and generated docs for the new scoring workflow.

Validation

./scripts/fast-feedback.sh passed — not run for PR creation.
bun run ci passed, or not required for this change — not run for PR creation.
Behavior docs updated (if behavior changed) — updated product/platform and related docs in this branch.

Targeted validation run:

bun test tests/unit/persistence/human-scoring.test.ts tests/integration/server/human-scoring.test.ts

Result: passed, 7 tests.

Screenshots / video

N/A for this PR creation pass.

Adds an end-to-end "Score" surface for human review of completed runs: a new persisted human_dimension_scores table mirroring judge_dimension_scores, HTTP routes that drain an unscored backlog one chat at a time, and a React dashboard view with rubric/objective/tool-call sidebars and Pearson-correlation pills against the LLM judge scores. Replaces the legacy inline dashboard with the built React bundle as the only frontend, and adds a one-shot seed-test-scores script for retargeting old data onto the new product rubric. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The PostgresRepository.listPresets path uses `sql.unsafe(RUN_SUMMARY_COLUMNS)` inline inside a tagged template to interpolate the column list. JS evaluates that call eagerly before the tagged template runs, so the mock's `sql.unsafe` was being invoked with just the column list and throwing because the text did not match any "from <table>" branch. Make `sql.unsafe` return an inert empty result for fragment-style calls instead of throwing; the parent template still records the real query string so the existing query-count assertions hold. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Re-run of `docs:quality` and `docs:workspace` after the test fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Swiftyos and others added 4 commits May 8, 2026 15:36

update rubric

984168b

refresh generated docs (quality score + workspace inventory)

c9ffccb

Re-run of `docs:quality` and `docs:workspace` after the test fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add human scoring workflow#39

Add human scoring workflow#39
Swiftyos wants to merge 4 commits into
mainfrom
claude/sad-jennings-7a1c49

Swiftyos commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Swiftyos commented May 11, 2026

Intent

Behavior changes

Validation

Screenshots / video

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant