Fix eval summary for heterogeneous metrics#1295
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b63b84aa8a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return {k: [m[k] for m in list_of_dicts] for k in list_of_dicts[0].keys()} | ||
| """Convert a list of mappings to a dictionary of lists.""" | ||
| keys = sorted({key for mapping in list_of_dicts for key in mapping}) | ||
| return {key: [mapping.get(key, 0.0) for mapping in list_of_dicts] for key in keys} |
There was a problem hiding this comment.
Align sparse metric averages with metadata
When only some rollouts include a metric, this zero-fills the missing entries for print_rewards, but the saved metadata/live display are built by EnvMetrics, which only counts outputs where the key is present (verifiers/utils/metric_utils.py:117-126). In that sparse-metric scenario the same run can print rlm_compactions_count as averaged over all rollouts while metadata["avg_metrics"] stores the average over present values, making the CLI summary disagree with saved evaluation results; either the accumulator needs the same zero-fill semantics or printing should skip missing values.
Useful? React with 👍 / 👎.
|
@xeophon hmm not sure if we wanna default to 0 or smooth out? could be confusing to pull lower |
b63b84a to
ab0a9b4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab0a9b4928
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| captured = capsys.readouterr() | ||
|
|
||
| assert "rlm_compactions_count: avg - 0.500" in captured.out | ||
| assert "r1: [0.0, 1.0]" in captured.out |
There was a problem hiding this comment.
Fix the reversed reward assertion
In this test setup the first output has reward=1.0 and the second has reward=0.0, and print_results() delegates to print_rewards() without reordering outputs, so the printed single-rollout row is r1: [1.0, 0.0]. This assertion expects the reverse order, causing the new regression test to fail instead of validating the heterogeneous metric behavior.
Useful? React with 👍 / 👎.
ab0a9b4 to
ed7e06e
Compare
|
@willccbb changed, doesn't set it to 0 anymore and instead skips them |
Summary
Tests
Note
Low Risk
Low risk: changes are limited to evaluation summary formatting/aggregation and add tests; main risk is altering printed output expectations for downstream consumers.
Overview
Printed eval metric summaries now support heterogeneous rollout
metricsby building columns from the union of metric keys and treating missing values as absent (excluded from avg/std and per-rollout lists) instead of raising/keying off the first rollout.Adds a regression test ensuring
print_resultscorrectly prints averages and per-rollout values when some outputs omit certain metric keys.Reviewed by Cursor Bugbot for commit ed7e06e. Bugbot is set up for automated code reviews on this repo. Configure here.