Skip to content

feat: capture JudgeRubric input/output for platform Judges View#1275

Open
cdreetz wants to merge 1 commit intomainfrom
feature/judges-view
Open

feat: capture JudgeRubric input/output for platform Judges View#1275
cdreetz wants to merge 1 commit intomainfrom
feature/judges-view

Conversation

@cdreetz
Copy link
Copy Markdown
Collaborator

@cdreetz cdreetz commented May 1, 2026

Summary

  • JudgeRubric.judge() now records each call (input prompt, raw output, rubric name, model, score, timestamp) onto state["judges"]. Every call is recorded — including cache hits — so two JudgeRubric instances in a RubricGroup are both attributed.
  • state_to_output auto-emits state["judges"] onto the RolloutOutput when non-empty (no state_columns opt-in needed).
  • Adds an optional name= constructor arg to JudgeRubric so multiple instances can self-identify.

The companion platform PR (https://linear.app/primeintellect/issue/ENG-3552) reads the new judges field on each rollout sample and renders it in a "Judges" section in the rollout view's reward panel, with a fullscreen popup for long judge prompts.

The existing state["judge_response"] cache is preserved unchanged for backwards-compatibility.

Test plan

  • tests/test_judges_view.py covers: judge call → record on state, two rubrics in a group both recorded, state_to_output propagates, absence yields no field
  • uv run pytest tests/test_judges_view.py tests/test_save_utils.py tests/test_rubric.py tests/test_rubric_group.py tests/test_math_rubric.py — all green
  • uv run ruff check — clean

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new judges payload to rollout outputs and records judge calls on every JudgeRubric.judge() invocation (including cache hits), which could affect output size/compatibility for downstream consumers expecting a stable schema.

Overview
Adds per-invocation capture of LLM-as-judge calls: JudgeRubric now appends a JudgeRecord (prompt, raw output, rubric identifier, model, timestamp) into state["judges"], with an optional name= to distinguish multiple rubric instances and recording even on cache hits.

Updates serialization so state_to_output/states_to_outputs auto-propagate non-empty state["judges"] into RolloutOutput as judges (no state_columns opt-in), and introduces the JudgeRecord type plus tests covering recording, naming, propagation, and omission when absent.

Reviewed by Cursor Bugbot for commit 8b58330. Bugbot is set up for automated code reviews on this repo. Configure here.

JudgeRubric.judge() now appends a JudgeRecord (judge_input, judge_output,
rubric, model, score, timestamp) to state["judges"] on every call,
including cache hits, so two rubrics in a RubricGroup are both
attributed.

state_to_output auto-emits state["judges"] onto the RolloutOutput when
non-empty, so env authors don't need to register a state column. The
existing state["judge_response"] cache is preserved for back-compat.

Adds a name= constructor arg to JudgeRubric so multiple instances can
self-identify, and a JudgeRecord TypedDict on RolloutOutput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8b58330. Configure here.

Comment thread verifiers/types.py
rubric: str
model: str
score: float | None
timestamp: float
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JudgeRecord total=False makes all fields optional unexpectedly

Low Severity

JudgeRecord uses total=False which makes every field optional at the type level, but the docstring explicitly states that judge_input and judge_output are required. This mismatch means type checkers won't enforce the presence of these two critical fields, silently allowing incomplete records. Using Required[] from typing on those two fields (or splitting into a base TypedDict with total=True) would correctly express the intended contract.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b58330. Configure here.

self.judge_model = judge_model
self.judge_prompt = judge_prompt
self.judge_sampling_args = judge_sampling_args or {}
self.name = name or self.__class__.__name__
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation update for new user-facing features

Low Severity

This PR adds the name= constructor parameter to JudgeRubric and a new judges field on RolloutOutput, both user-facing. The existing docs in docs/environments.md (which shows JudgeRubric usage) and docs/reference.md (which lists RolloutOutput fields and mentions JudgeRubric) are not updated to reflect these additions, violating the documentation update rule.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 8b58330. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant