Skip to content

fix(math_group): replace raw == with MathRubric for math-aware answer comparison#1315

Open
jfeldstein wants to merge 5 commits intoPrimeIntellect-ai:mainfrom
jfeldstein:fix/math-group-use-math-rubric
Open

fix(math_group): replace raw == with MathRubric for math-aware answer comparison#1315
jfeldstein wants to merge 5 commits intoPrimeIntellect-ai:mainfrom
jfeldstein:fix/math-group-use-math-rubric

Conversation

@jfeldstein
Copy link
Copy Markdown

@jfeldstein jfeldstein commented May 8, 2026

Currently

Both sub-env rubrics in math_group use raw Python string equality to score answers:

return 1.0 if response == answer else 0.0

Semantically equivalent LaTeX forms receive 0 reward instead of 1 — e.g. \frac{1}{2} vs 0.5, \frac{3}{4} vs \dfrac{3}{4}, 1 + x vs x + 1. math-verify was already declared as a dependency in environments/math_group/pyproject.toml but never used.

Change Summary

Two independent TDD cycles, four commits:

Reward function fix (environments/math_group/math_group.py)

  • Replace both vf.Rubric(...) calls with vf.MathRubric(parser=parser), consistent with gsm8k, math_python, and doublecheck environments
  • Format-reward weights unchanged: gsm8k 0.0 (metric), math 0.2 (weighted)

Teardown propagation fix (verifiers/envs/env_group.py)

  • Add EnvGroupRubric.teardown() override to propagate teardown to child rubrics
  • MathRubric spawns a ProcessPoolExecutor; previously those workers leaked when the EnvGroup was torn down (vf.Rubric held no resources so this was harmless before this PR)
  • Chain: EnvGroupRubricRubricGroup.teardown()MathRubric.teardown() → executor shutdown

Acceptance Criteria

  • pytest tests/test_math_group.py — 13 tests pass
  • pytest tests/test_math_rubric.py tests/test_rubric.py tests/test_rubric_group.py tests/test_env_group.py — existing tests still pass
  • \frac{1}{2} vs 0.5, \frac{3}{4} vs \dfrac{3}{4}, 1 + x vs x + 1 all score 1.0
  • Wrong answers still score 0.0
  • MathRubric executors are shut down when EnvGroup is torn down

Notes

  • No new dependencies — math-verify was already in the environment's pyproject.toml
  • Metric key changes from math_answer_reward_func / gsm8k_answer_reward_funccorrect_answer, consistent with every other math environment in the repo. Both old names are confirmed unused anywhere else in the codebase (ripgrep: 0 matches).

Note

Medium Risk
Changes reward scoring semantics for math_group from string equality to math-aware equivalence, which can affect training/eval outcomes. Also adds teardown propagation in EnvGroupRubric, touching lifecycle management and async cleanup paths.

Overview
math_group now uses vf.MathRubric for both gsm8k and math sub-environments, replacing raw string-equality answer checks so symbolically equivalent LaTeX/decimal forms score as correct; the math env still applies a 0.2 weighted format bonus while gsm8k tracks format as a metric.

Adds a EnvGroupRubric.teardown() override to propagate teardown to child environment rubrics (preventing MathRubric executor leaks), and introduces tests/test_math_group.py covering equivalence scoring, format bonus contribution, wrong-answer regression, and teardown propagation.

Reviewed by Cursor Bugbot for commit d1bef95. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 679975e. Configure here.

Comment thread environments/math_group/math_group.py Outdated
@jfeldstein jfeldstein force-pushed the fix/math-group-use-math-rubric branch 2 times, most recently from e163b12 to be0f944 Compare May 8, 2026 13:39
@jfeldstein jfeldstein marked this pull request as draft May 8, 2026 13:48
@jfeldstein jfeldstein force-pushed the fix/math-group-use-math-rubric branch from be0f944 to 634dd86 Compare May 8, 2026 13:56
@jfeldstein jfeldstein marked this pull request as ready for review May 8, 2026 13:57
jfeldstein and others added 4 commits May 8, 2026 10:01
MathRubric uses math_verify for symbolic comparison; raw string equality
misses cases like \frac{1}{2} vs 0.5 or \frac{3}{4} vs \dfrac{3}{4}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both sub-env rubrics now use vf.MathRubric, consistent with gsm8k,
math_python, and doublecheck environments. Format-reward weights are
unchanged (gsm8k: 0.0 metric, math: 0.2 weighted).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MathRubric spawns a ProcessPoolExecutor; without teardown propagation
those workers leak when the EnvGroup is torn down.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EnvGroupRubric.teardown() now iterates child envs and calls
rubric.teardown() on each, which propagates through RubricGroup to
MathRubric.teardown() and shuts down the ProcessPoolExecutor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jfeldstein jfeldstein force-pushed the fix/math-group-use-math-rubric branch from 634dd86 to 9b12cb0 Compare May 8, 2026 14:04
CI Style workflow runs ruff format --check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant