feat: capture JudgeRubric input/output for platform Judges View#1275
feat: capture JudgeRubric input/output for platform Judges View#1275
Conversation
JudgeRubric.judge() now appends a JudgeRecord (judge_input, judge_output, rubric, model, score, timestamp) to state["judges"] on every call, including cache hits, so two rubrics in a RubricGroup are both attributed. state_to_output auto-emits state["judges"] onto the RolloutOutput when non-empty, so env authors don't need to register a state column. The existing state["judge_response"] cache is preserved for back-compat. Adds a name= constructor arg to JudgeRubric so multiple instances can self-identify, and a JudgeRecord TypedDict on RolloutOutput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8b58330. Configure here.
| rubric: str | ||
| model: str | ||
| score: float | None | ||
| timestamp: float |
There was a problem hiding this comment.
JudgeRecord total=False makes all fields optional unexpectedly
Low Severity
JudgeRecord uses total=False which makes every field optional at the type level, but the docstring explicitly states that judge_input and judge_output are required. This mismatch means type checkers won't enforce the presence of these two critical fields, silently allowing incomplete records. Using Required[] from typing on those two fields (or splitting into a base TypedDict with total=True) would correctly express the intended contract.
Reviewed by Cursor Bugbot for commit 8b58330. Configure here.
| self.judge_model = judge_model | ||
| self.judge_prompt = judge_prompt | ||
| self.judge_sampling_args = judge_sampling_args or {} | ||
| self.name = name or self.__class__.__name__ |
There was a problem hiding this comment.
Missing documentation update for new user-facing features
Low Severity
This PR adds the name= constructor parameter to JudgeRubric and a new judges field on RolloutOutput, both user-facing. The existing docs in docs/environments.md (which shows JudgeRubric usage) and docs/reference.md (which lists RolloutOutput fields and mentions JudgeRubric) are not updated to reflect these additions, violating the documentation update rule.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 8b58330. Configure here.


Summary
JudgeRubric.judge()now records each call (input prompt, raw output, rubric name, model, score, timestamp) ontostate["judges"]. Every call is recorded — including cache hits — so twoJudgeRubricinstances in aRubricGroupare both attributed.state_to_outputauto-emitsstate["judges"]onto theRolloutOutputwhen non-empty (nostate_columnsopt-in needed).name=constructor arg toJudgeRubricso multiple instances can self-identify.The companion platform PR (https://linear.app/primeintellect/issue/ENG-3552) reads the new
judgesfield on each rollout sample and renders it in a "Judges" section in the rollout view's reward panel, with a fullscreen popup for long judge prompts.The existing
state["judge_response"]cache is preserved unchanged for backwards-compatibility.Test plan
tests/test_judges_view.pycovers: judge call → record on state, two rubrics in a group both recorded,state_to_outputpropagates, absence yields no fielduv run pytest tests/test_judges_view.py tests/test_save_utils.py tests/test_rubric.py tests/test_rubric_group.py tests/test_math_rubric.py— all greenuv run ruff check— clean🤖 Generated with Claude Code
Note
Medium Risk
Adds a new
judgespayload to rollout outputs and records judge calls on everyJudgeRubric.judge()invocation (including cache hits), which could affect output size/compatibility for downstream consumers expecting a stable schema.Overview
Adds per-invocation capture of LLM-as-judge calls:
JudgeRubricnow appends aJudgeRecord(prompt, raw output, rubric identifier, model, timestamp) intostate["judges"], with an optionalname=to distinguish multiple rubric instances and recording even on cache hits.Updates serialization so
state_to_output/states_to_outputsauto-propagate non-emptystate["judges"]intoRolloutOutputasjudges(nostate_columnsopt-in), and introduces theJudgeRecordtype plus tests covering recording, naming, propagation, and omission when absent.Reviewed by Cursor Bugbot for commit 8b58330. Bugbot is set up for automated code reviews on this repo. Configure here.