Fix eval summary for heterogeneous metrics by xeophon · Pull Request #1295 · PrimeIntellect-ai/verifiers

xeophon · 2026-05-06T19:49:55Z

Summary

build metric summary columns from the union of rollout metric keys
keep sparse metrics sparse: missing metric keys are skipped in print_rewards averages and rollout lists instead of zero-filled
add a regression test for heterogeneous rollout metrics

Tests

uv run pytest tests/test_eval_utils.py tests/test_save_utils.py -k 'eval_utils or EnvMetrics'
uv run pre-commit run --files verifiers/utils/eval_utils.py tests/test_eval_utils.py

Note

Low Risk
Low risk: changes are limited to evaluation summary formatting/aggregation and add tests; main risk is altering printed output expectations for downstream consumers.

Overview
Printed eval metric summaries now support heterogeneous rollout metrics by building columns from the union of metric keys and treating missing values as absent (excluded from avg/std and per-rollout lists) instead of raising/keying off the first rollout.

Adds a regression test ensuring print_results correctly prints averages and per-rollout values when some outputs omit certain metric keys.

^{Reviewed by Cursor Bugbot for commit ed7e06e. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b63b84aa8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T19:52:06Z

-    return {k: [m[k] for m in list_of_dicts] for k in list_of_dicts[0].keys()}
+    """Convert a list of mappings to a dictionary of lists."""
+    keys = sorted({key for mapping in list_of_dicts for key in mapping})
+    return {key: [mapping.get(key, 0.0) for mapping in list_of_dicts] for key in keys}


Align sparse metric averages with metadata

When only some rollouts include a metric, this zero-fills the missing entries for print_rewards, but the saved metadata/live display are built by EnvMetrics, which only counts outputs where the key is present (verifiers/utils/metric_utils.py:117-126). In that sparse-metric scenario the same run can print rlm_compactions_count as averaged over all rollouts while metadata["avg_metrics"] stores the average over present values, making the CLI summary disagree with saved evaluation results; either the accumulator needs the same zero-fill semantics or printing should skip missing values.

Useful? React with 👍 / 👎.

willccbb · 2026-05-07T00:27:23Z

@xeophon hmm not sure if we wanna default to 0 or smooth out? could be confusing to pull lower

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab0a9b4928

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T12:30:11Z

+    captured = capsys.readouterr()
+
+    assert "rlm_compactions_count: avg - 0.500" in captured.out
+    assert "r1: [0.0, 1.0]" in captured.out


Fix the reversed reward assertion

In this test setup the first output has reward=1.0 and the second has reward=0.0, and print_results() delegates to print_rewards() without reordering outputs, so the printed single-rollout row is r1: [1.0, 0.0]. This assertion expects the reverse order, causing the new regression test to fail instead of validating the heterogeneous metric behavior.

Useful? React with 👍 / 👎.

xeophon · 2026-05-07T12:32:59Z

@willccbb changed, doesn't set it to 0 anymore and instead skips them

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

d42me approved these changes May 6, 2026

View reviewed changes

xeophon force-pushed the fix-heterogeneous-eval-metrics branch from b63b84a to ab0a9b4 Compare May 7, 2026 12:28

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Fix eval summary for heterogeneous metrics

ed7e06e

xeophon force-pushed the fix-heterogeneous-eval-metrics branch from ab0a9b4 to ed7e06e Compare May 7, 2026 12:32

willccbb merged commit 558aa8a into main May 8, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix eval summary for heterogeneous metrics#1295

Fix eval summary for heterogeneous metrics#1295
willccbb merged 1 commit intomainfrom
fix-heterogeneous-eval-metrics

xeophon commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 6, 2026

Uh oh!

willccbb commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

xeophon commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xeophon commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

willccbb commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xeophon commented May 6, 2026 •

edited by cursor Bot

Loading

xeophon commented May 7, 2026 •

edited

Loading