fix(math_group): replace raw == with MathRubric for math-aware answer comparison by jfeldstein · Pull Request #1315 · PrimeIntellect-ai/verifiers

jfeldstein · 2026-05-08T13:28:58Z

Currently

Both sub-env rubrics in math_group use raw Python string equality to score answers:

return 1.0 if response == answer else 0.0

Semantically equivalent LaTeX forms receive 0 reward instead of 1 — e.g. \frac{1}{2} vs 0.5, \frac{3}{4} vs \dfrac{3}{4}, 1 + x vs x + 1. math-verify was already declared as a dependency in environments/math_group/pyproject.toml but never used.

Change Summary

Two independent TDD cycles, four commits:

Reward function fix (environments/math_group/math_group.py)

Replace both vf.Rubric(...) calls with vf.MathRubric(parser=parser), consistent with gsm8k, math_python, and doublecheck environments
Format-reward weights unchanged: gsm8k 0.0 (metric), math 0.2 (weighted)

Teardown propagation fix (verifiers/envs/env_group.py)

Add EnvGroupRubric.teardown() override to propagate teardown to child rubrics
MathRubric spawns a ProcessPoolExecutor; previously those workers leaked when the EnvGroup was torn down (vf.Rubric held no resources so this was harmless before this PR)
Chain: EnvGroupRubric → RubricGroup.teardown() → MathRubric.teardown() → executor shutdown

Acceptance Criteria

pytest tests/test_math_group.py — 13 tests pass
pytest tests/test_math_rubric.py tests/test_rubric.py tests/test_rubric_group.py tests/test_env_group.py — existing tests still pass
\frac{1}{2} vs 0.5, \frac{3}{4} vs \dfrac{3}{4}, 1 + x vs x + 1 all score 1.0
Wrong answers still score 0.0
MathRubric executors are shut down when EnvGroup is torn down

Notes

No new dependencies — math-verify was already in the environment's pyproject.toml
Metric key changes from math_answer_reward_func / gsm8k_answer_reward_func → correct_answer, consistent with every other math environment in the repo. Both old names are confirmed unused anywhere else in the codebase (ripgrep: 0 matches).

Note

Medium Risk
Changes reward scoring semantics for math_group from string equality to math-aware equivalence, which can affect training/eval outcomes. Also adds teardown propagation in EnvGroupRubric, touching lifecycle management and async cleanup paths.

Overview
math_group now uses vf.MathRubric for both gsm8k and math sub-environments, replacing raw string-equality answer checks so symbolically equivalent LaTeX/decimal forms score as correct; the math env still applies a 0.2 weighted format bonus while gsm8k tracks format as a metric.

Adds a EnvGroupRubric.teardown() override to propagate teardown to child environment rubrics (preventing MathRubric executor leaks), and introduces tests/test_math_group.py covering equivalence scoring, format bonus contribution, wrong-answer regression, and teardown propagation.

^{Reviewed by Cursor Bugbot for commit d1bef95. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 679975e. Configure here.}

MathRubric uses math_verify for symbolic comparison; raw string equality misses cases like \frac{1}{2} vs 0.5 or \frac{3}{4} vs \dfrac{3}{4}. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both sub-env rubrics now use vf.MathRubric, consistent with gsm8k, math_python, and doublecheck environments. Format-reward weights are unchanged (gsm8k: 0.0 metric, math: 0.2 weighted). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MathRubric spawns a ProcessPoolExecutor; without teardown propagation those workers leak when the EnvGroup is torn down. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

EnvGroupRubric.teardown() now iterates child envs and calls rubric.teardown() on each, which propagates through RubricGroup to MathRubric.teardown() and shuts down the ProcessPoolExecutor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CI Style workflow runs ruff format --check.

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread environments/math_group/math_group.py Outdated

jfeldstein force-pushed the fix/math-group-use-math-rubric branch 2 times, most recently from e163b12 to be0f944 Compare May 8, 2026 13:39

jfeldstein marked this pull request as draft May 8, 2026 13:48

jfeldstein force-pushed the fix/math-group-use-math-rubric branch from be0f944 to 634dd86 Compare May 8, 2026 13:56

jfeldstein marked this pull request as ready for review May 8, 2026 13:57

jfeldstein and others added 4 commits May 8, 2026 10:01

test: math_group rubrics must accept equivalent LaTeX representations

62e5c3c

MathRubric uses math_verify for symbolic comparison; raw string equality misses cases like \frac{1}{2} vs 0.5 or \frac{3}{4} vs \dfrac{3}{4}. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: EnvGroupRubric must propagate teardown to child rubrics

f68ff7d

MathRubric spawns a ProcessPoolExecutor; without teardown propagation those workers leak when the EnvGroup is torn down. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jfeldstein force-pushed the fix/math-group-use-math-rubric branch from 634dd86 to 9b12cb0 Compare May 8, 2026 14:04

style: ruff format test_math_group

d1bef95

CI Style workflow runs ruff format --check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(math_group): replace raw == with MathRubric for math-aware answer comparison#1315

fix(math_group): replace raw == with MathRubric for math-aware answer comparison#1315
jfeldstein wants to merge 5 commits intoPrimeIntellect-ai:mainfrom
jfeldstein:fix/math-group-use-math-rubric

jfeldstein commented May 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfeldstein commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Currently

Change Summary

Acceptance Criteria

Notes

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jfeldstein commented May 8, 2026 •

edited by cursor Bot

Loading