feat: capture JudgeRubric input/output for platform Judges View by cdreetz · Pull Request #1275 · PrimeIntellect-ai/verifiers

cdreetz · 2026-05-01T21:51:29Z

Summary

JudgeRubric.judge() now records each call (input prompt, raw output, rubric name, model, score, timestamp) onto state["judges"]. Every call is recorded — including cache hits — so two JudgeRubric instances in a RubricGroup are both attributed.
state_to_output auto-emits state["judges"] onto the RolloutOutput when non-empty (no state_columns opt-in needed).
Adds an optional name= constructor arg to JudgeRubric so multiple instances can self-identify.

The companion platform PR (https://linear.app/primeintellect/issue/ENG-3552) reads the new judges field on each rollout sample and renders it in a "Judges" section in the rollout view's reward panel, with a fullscreen popup for long judge prompts.

The existing state["judge_response"] cache is preserved unchanged for backwards-compatibility.

Test plan

tests/test_judges_view.py covers: judge call → record on state, two rubrics in a group both recorded, state_to_output propagates, absence yields no field
uv run pytest tests/test_judges_view.py tests/test_save_utils.py tests/test_rubric.py tests/test_rubric_group.py tests/test_math_rubric.py — all green
uv run ruff check — clean

🤖 Generated with Claude Code

Note

Medium Risk
Adds a new judges payload to rollout outputs and records judge calls on every JudgeRubric.judge() invocation (including cache hits), which could affect output size/compatibility for downstream consumers expecting a stable schema.

Overview
Adds per-invocation capture of LLM-as-judge calls: JudgeRubric now appends a JudgeRecord (prompt, raw output, rubric identifier, model, timestamp) into state["judges"], with an optional name= to distinguish multiple rubric instances and recording even on cache hits.

Updates serialization so state_to_output/states_to_outputs auto-propagate non-empty state["judges"] into RolloutOutput as judges (no state_columns opt-in), and introduces the JudgeRecord type plus tests covering recording, naming, propagation, and omission when absent.

^{Reviewed by Cursor Bugbot for commit 8b58330. Bugbot is set up for automated code reviews on this repo. Configure here.}

JudgeRubric.judge() now appends a JudgeRecord (judge_input, judge_output, rubric, model, score, timestamp) to state["judges"] on every call, including cache hits, so two rubrics in a RubricGroup are both attributed. state_to_output auto-emits state["judges"] onto the RolloutOutput when non-empty, so env authors don't need to register a state column. The existing state["judge_response"] cache is preserved for back-compat. Adds a name= constructor arg to JudgeRubric so multiple instances can self-identify, and a JudgeRecord TypedDict on RolloutOutput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8b58330. Configure here.}

cursor · 2026-05-01T21:56:12Z

+    rubric: str
+    model: str
+    score: float | None
+    timestamp: float


JudgeRecord total=False makes all fields optional unexpectedly

Low Severity

JudgeRecord uses total=False which makes every field optional at the type level, but the docstring explicitly states that judge_input and judge_output are required. This mismatch means type checkers won't enforce the presence of these two critical fields, silently allowing incomplete records. Using Required[] from typing on those two fields (or splitting into a base TypedDict with total=True) would correctly express the intended contract.

^{Reviewed by Cursor Bugbot for commit 8b58330. Configure here.}

cursor · 2026-05-01T21:56:12Z

        self.judge_model = judge_model
        self.judge_prompt = judge_prompt
        self.judge_sampling_args = judge_sampling_args or {}
+        self.name = name or self.__class__.__name__


Missing documentation update for new user-facing features

Low Severity

This PR adds the name= constructor parameter to JudgeRubric and a new judges field on RolloutOutput, both user-facing. The existing docs in docs/environments.md (which shows JudgeRubric usage) and docs/reference.md (which lists RolloutOutput fields and mentions JudgeRubric) are not updated to reflect these additions, violating the documentation update rule.

Additional Locations (1)

verifiers/types.py#L375-L376

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 8b58330. Configure here.}

cursor Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture JudgeRubric input/output for platform Judges View#1275

feat: capture JudgeRubric input/output for platform Judges View#1275
cdreetz wants to merge 1 commit intomainfrom
feature/judges-view

cdreetz commented May 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 1, 2026

Uh oh!

cursor Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cdreetz commented May 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 1, 2026

Choose a reason for hiding this comment

JudgeRecord total=False makes all fields optional unexpectedly

Uh oh!

cursor Bot May 1, 2026

Choose a reason for hiding this comment

Missing documentation update for new user-facing features

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cdreetz commented May 1, 2026 •

edited by cursor Bot

Loading

`JudgeRecord` `total=False` makes all fields optional unexpectedly