Skip to content

Fix eval summary for heterogeneous metrics#1295

Merged
willccbb merged 1 commit intomainfrom
fix-heterogeneous-eval-metrics
May 8, 2026
Merged

Fix eval summary for heterogeneous metrics#1295
willccbb merged 1 commit intomainfrom
fix-heterogeneous-eval-metrics

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented May 6, 2026

Summary

  • build metric summary columns from the union of rollout metric keys
  • keep sparse metrics sparse: missing metric keys are skipped in print_rewards averages and rollout lists instead of zero-filled
  • add a regression test for heterogeneous rollout metrics

Tests

  • uv run pytest tests/test_eval_utils.py tests/test_save_utils.py -k 'eval_utils or EnvMetrics'
  • uv run pre-commit run --files verifiers/utils/eval_utils.py tests/test_eval_utils.py

Note

Low Risk
Low risk: changes are limited to evaluation summary formatting/aggregation and add tests; main risk is altering printed output expectations for downstream consumers.

Overview
Printed eval metric summaries now support heterogeneous rollout metrics by building columns from the union of metric keys and treating missing values as absent (excluded from avg/std and per-rollout lists) instead of raising/keying off the first rollout.

Adds a regression test ensuring print_results correctly prints averages and per-rollout values when some outputs omit certain metric keys.

Reviewed by Cursor Bugbot for commit ed7e06e. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b63b84aa8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/utils/eval_utils.py Outdated
return {k: [m[k] for m in list_of_dicts] for k in list_of_dicts[0].keys()}
"""Convert a list of mappings to a dictionary of lists."""
keys = sorted({key for mapping in list_of_dicts for key in mapping})
return {key: [mapping.get(key, 0.0) for mapping in list_of_dicts] for key in keys}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align sparse metric averages with metadata

When only some rollouts include a metric, this zero-fills the missing entries for print_rewards, but the saved metadata/live display are built by EnvMetrics, which only counts outputs where the key is present (verifiers/utils/metric_utils.py:117-126). In that sparse-metric scenario the same run can print rlm_compactions_count as averaged over all rollouts while metadata["avg_metrics"] stores the average over present values, making the CLI summary disagree with saved evaluation results; either the accumulator needs the same zero-fill semantics or printing should skip missing values.

Useful? React with 👍 / 👎.

@willccbb
Copy link
Copy Markdown
Member

willccbb commented May 7, 2026

@xeophon hmm not sure if we wanna default to 0 or smooth out? could be confusing to pull lower

@xeophon xeophon force-pushed the fix-heterogeneous-eval-metrics branch from b63b84a to ab0a9b4 Compare May 7, 2026 12:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab0a9b4928

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_eval_utils.py Outdated
captured = capsys.readouterr()

assert "rlm_compactions_count: avg - 0.500" in captured.out
assert "r1: [0.0, 1.0]" in captured.out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix the reversed reward assertion

In this test setup the first output has reward=1.0 and the second has reward=0.0, and print_results() delegates to print_rewards() without reordering outputs, so the printed single-rollout row is r1: [1.0, 0.0]. This assertion expects the reverse order, causing the new regression test to fail instead of validating the heterogeneous metric behavior.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the fix-heterogeneous-eval-metrics branch from ab0a9b4 to ed7e06e Compare May 7, 2026 12:32
@xeophon
Copy link
Copy Markdown
Member Author

xeophon commented May 7, 2026

@willccbb changed, doesn't set it to 0 anymore and instead skips them

@willccbb willccbb merged commit 558aa8a into main May 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants