feat: expose training context to rubrics via request protocol#1270
feat: expose training context to rubrics via request protocol#1270shriramc1 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
Conversation
4456977 to
475a603
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 475a603. Configure here.
| ) -> RunRolloutResponse: | ||
| async def handle_run_rollout(self, request: RunRolloutRequest) -> RunRolloutResponse: | ||
| if request.training_context is not None: | ||
| self.env.rubric.training_context = request.training_context |
There was a problem hiding this comment.
Concurrent requests corrupt shared rubric training context
High Severity
training_context is set on the shared self.env.rubric instance before awaiting run_rollout/run_group. The worker's serve() method dispatches requests as concurrent asyncio tasks via asyncio.create_task, so a second request can overwrite self.env.rubric.training_context before the first request reaches its scoring phase. This causes rollouts to be scored with the wrong training context. The same issue exists in environment.py's local-mode path. Additionally, the if not None guard means a stale training_context from a previous request persists when a subsequent request omits it.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 475a603. Configure here.
There was a problem hiding this comment.
Addressed in the latest push (cd2700c):
-
Stale context: Removed the
if not Noneguard — we now always assigntraining_context(even whenNone), so each request explicitly sets or clears it. -
Concurrency: In the current architecture, each
EnvWorkerprocesses requests sequentially through its event loop —training_contextis set immediately before the scoring call within the same coroutine, so interleaving isn't possible within a single worker. The router distributes groups round-robin across workers, so no two concurrent requests share a rubric instance.
|
|
||
| # Training context set by the orchestrator before scoring. | ||
| # Contains metadata like {"step": int, "ckpt_step": int}. | ||
| self.training_context: dict | None = None |
There was a problem hiding this comment.
Missing documentation for new training context feature
Low Severity
This PR adds the user-facing training_context attribute to Rubric and new training_context parameters to Environment.run_rollout and Environment.run_group, but no corresponding updates were made to docs/reference.md or docs/environments.md, both of which document these classes and methods. Per project rules, PRs modifying core user-facing functionality described in docs/ must update the relevant documentation.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 475a603. Configure here.
There was a problem hiding this comment.
Acknowledged — will add documentation in a follow-up once the API stabilizes through review. The feature is intentionally minimal right now (optional dict, defaults to None) so existing code is unaffected.
| ) | ||
|
|
||
| if training_context is not None: | ||
| self.rubric.training_context = training_context |
There was a problem hiding this comment.
Training context not propagated to child rubrics in RubricGroup
High Severity
Setting self.rubric.training_context only assigns to the top-level rubric. Nearly all environment types (MultiTurnEnv, ToolEnv, SandboxEnv, etc.) call add_rubric() during init, which wraps the user's rubric in a RubricGroup. Since RubricGroup doesn't propagate training_context to its child rubrics, any custom rubric reading self.training_context in score_group or score_rollout will always see None. This makes the feature non-functional for all standard environment types.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 475a603. Configure here.
There was a problem hiding this comment.
Addressed in the latest push (cd2700c):
Added a property override in RubricGroup that propagates training_context to all child rubrics on assignment. This ensures user rubrics wrapped in a RubricGroup (which is the standard path for MultiTurnEnv, ToolEnv, etc.) receive the context correctly.
475a603 to
bf0a711
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bf0a711 to
cd2700c
Compare


Summary
Adds an optional
training_contextdict that orchestrators can pass to rubrics before scoring. This enables step-aware reward functions (curriculum learning, penalty warmup, dynamic weights) without requiring environments to maintain internal step counters or process restarts via env-args-scheduler.Motivation
There is currently no way for a rubric to know the current training step. This forces environment authors to use fragile workarounds like self-incrementing counters that don't survive checkpoint resume and don't reflect the true orchestrator step. Use cases blocked by this gap:
The existing
env_args_scheduler(PR #2207 in prime-rl) solves this by hot-reloading entire environments, which is too heavy for continuous reward parameter changes.Design
The
training_contextis a simpledict | Nonethat flows through both execution paths:Server mode (ZMQ):
Local mode (in-process):
Usage in a custom rubric:
Type of Change
Testing
uv run pytestlocally.Checklist
Changes
verifiers/serve/types.pytraining_context: dict | None = NonetoRunRolloutRequestandRunGroupRequestverifiers/rubrics/rubric.pyself.training_context: dict | None = NonetoRubric.__init__verifiers/serve/client/env_client.pytraining_contextinrun_rolloutandrun_groupverifiers/serve/server/env_worker.pyself.env.rubric.training_contextbefore handling requestsverifiers/envs/environment.pytraining_contextinrun_rollout/run_group, set on rubric in local mode, forward to env_client in server modeBackward Compatibility
Fully backward-compatible:
Nonetraining_contextsee zero behavior changeself.training_contextare unaffectedmodel_configis not strict); old clients sending requests without it work fine since the field has a defaultCompanion PR
The prime-rl companion PR (to populate
training_contextfrom the scheduler) is independent — this PR is useful standalone for:training_contextdirectly inrun_groupcalls)Additional Notes
See also: prime-rl companion PR that populates this field from the scheduler's step counter.
Note
Medium Risk
Adds new mutable
training_contextplumbing through local and server execution paths; incorrect propagation or concurrent request handling could cause context leakage between rollouts/groups.Overview
Introduces an optional
training_context: dict | Nonethat can be passed intoEnvironment.run_rollout/run_groupand carried through the ZMQ request protocol.In local mode, the context is set on
self.rubric.training_contextbefore scoring; in server mode it is forwarded viaRunRolloutRequest/RunGroupRequestand applied inEnvWorkerbefore executing the request. Test stubs were updated to accept extra kwargs for the extended call signatures.Reviewed by Cursor Bugbot for commit bf0a711. Bugbot is set up for automated code reviews on this repo. Configure here.