Skip to content

Add human scoring workflow#39

Open
Swiftyos wants to merge 4 commits into
mainfrom
claude/sad-jennings-7a1c49
Open

Add human scoring workflow#39
Swiftyos wants to merge 4 commits into
mainfrom
claude/sad-jennings-7a1c49

Conversation

@Swiftyos
Copy link
Copy Markdown

Intent

Adds the human scoring workflow so reviewers can manually score completed scenario runs against rubric dimensions and compare those scores with automated judge output. The branch includes dashboard UI, HTTP API routes, SQLite/Postgres persistence support, migrations, and a seed script for test scores.

Behavior changes

  • Adds a dashboard human scoring view for selecting rubric dimensions, reviewing conversations, recording scores, and tracking rubric coverage.
  • Adds /api/human-scoring/rubrics, /api/human-scoring/next, and /api/human-scoring/scores server routes.
  • Persists human scoring data in SQLite and Postgres backends, including migration/schema updates.
  • Refreshes rubric/scenario data and generated docs for the new scoring workflow.

Validation

  • ./scripts/fast-feedback.sh passed — not run for PR creation.
  • bun run ci passed, or not required for this change — not run for PR creation.
  • Behavior docs updated (if behavior changed) — updated product/platform and related docs in this branch.

Targeted validation run:

bun test tests/unit/persistence/human-scoring.test.ts tests/integration/server/human-scoring.test.ts

Result: passed, 7 tests.

Screenshots / video

N/A for this PR creation pass.

Swiftyos and others added 4 commits May 8, 2026 15:36
Adds an end-to-end "Score" surface for human review of completed runs:
a new persisted human_dimension_scores table mirroring judge_dimension_scores,
HTTP routes that drain an unscored backlog one chat at a time, and a React
dashboard view with rubric/objective/tool-call sidebars and Pearson-correlation
pills against the LLM judge scores. Replaces the legacy inline dashboard with
the built React bundle as the only frontend, and adds a one-shot
seed-test-scores script for retargeting old data onto the new product rubric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The PostgresRepository.listPresets path uses `sql.unsafe(RUN_SUMMARY_COLUMNS)`
inline inside a tagged template to interpolate the column list. JS evaluates
that call eagerly before the tagged template runs, so the mock's `sql.unsafe`
was being invoked with just the column list and throwing because the text did
not match any "from <table>" branch. Make `sql.unsafe` return an inert empty
result for fragment-style calls instead of throwing; the parent template still
records the real query string so the existing query-count assertions hold.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-run of `docs:quality` and `docs:workspace` after the test fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant