Skip to content

feat: expose training context to rubrics via request protocol#1270

Open
shriramc1 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
shriramc1:feat/training-context-rubric
Open

feat: expose training context to rubrics via request protocol#1270
shriramc1 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
shriramc1:feat/training-context-rubric

Conversation

@shriramc1
Copy link
Copy Markdown

@shriramc1 shriramc1 commented Apr 30, 2026

Summary

Adds an optional training_context dict that orchestrators can pass to rubrics before scoring. This enables step-aware reward functions (curriculum learning, penalty warmup, dynamic weights) without requiring environments to maintain internal step counters or process restarts via env-args-scheduler.

Motivation

There is currently no way for a rubric to know the current training step. This forces environment authors to use fragile workarounds like self-incrementing counters that don't survive checkpoint resume and don't reflect the true orchestrator step. Use cases blocked by this gap:

  • Reward warmup: ramping penalty weights over training (e.g. tool call penalty warmup)
  • Curriculum learning: changing reward thresholds or component weights based on progress
  • Adaptive scoring: adjusting difficulty based on training metrics

The existing env_args_scheduler (PR #2207 in prime-rl) solves this by hot-reloading entire environments, which is too heavy for continuous reward parameter changes.

Design

The training_context is a simple dict | None that flows through both execution paths:

Server mode (ZMQ):

orchestrator → EnvClient.run_group(training_context={"step": N})
  → RunGroupRequest(training_context={"step": N})
    → env_worker sets rubric.training_context before scoring

Local mode (in-process):

Environment.run_group(training_context={"step": N})
  → sets self.rubric.training_context before scoring

Usage in a custom rubric:

class MyRubric(vf.Rubric):
    async def score_group(self, states):
        step = (self.training_context or {}).get("step", 0)
        warmup_progress = min(1.0, step / 100)
        penalty = -0.05 * warmup_progress
        # ... use penalty in scoring

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Changes

File Change
verifiers/serve/types.py Add optional training_context: dict | None = None to RunRolloutRequest and RunGroupRequest
verifiers/rubrics/rubric.py Add self.training_context: dict | None = None to Rubric.__init__
verifiers/serve/client/env_client.py Accept and forward training_context in run_rollout and run_group
verifiers/serve/server/env_worker.py Set self.env.rubric.training_context before handling requests
verifiers/envs/environment.py Accept training_context in run_rollout/run_group, set on rubric in local mode, forward to env_client in server mode

Backward Compatibility

Fully backward-compatible:

  • All new parameters default to None
  • Existing orchestrators that don't send training_context see zero behavior change
  • Existing rubrics that don't read self.training_context are unaffected
  • Wire protocol: old servers receiving a request with the extra field will ignore it (Pydantic model_config is not strict); old clients sending requests without it work fine since the field has a default

Companion PR

The prime-rl companion PR (to populate training_context from the scheduler) is independent — this PR is useful standalone for:

  • Local-mode environments (set training_context directly in run_group calls)
  • Custom orchestrators built on verifiers
  • Any user who wants step-aware rubrics without prime-rl

Additional Notes

See also: prime-rl companion PR that populates this field from the scheduler's step counter.


Note

Medium Risk
Adds new mutable training_context plumbing through local and server execution paths; incorrect propagation or concurrent request handling could cause context leakage between rollouts/groups.

Overview
Introduces an optional training_context: dict | None that can be passed into Environment.run_rollout/run_group and carried through the ZMQ request protocol.

In local mode, the context is set on self.rubric.training_context before scoring; in server mode it is forwarded via RunRolloutRequest/RunGroupRequest and applied in EnvWorker before executing the request. Test stubs were updated to accept extra kwargs for the extended call signatures.

Reviewed by Cursor Bugbot for commit bf0a711. Bugbot is set up for automated code reviews on this repo. Configure here.

@shriramc1 shriramc1 force-pushed the feat/training-context-rubric branch 2 times, most recently from 4456977 to 475a603 Compare April 30, 2026 22:21
@shriramc1 shriramc1 marked this pull request as ready for review April 30, 2026 22:23
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 475a603. Configure here.

Comment thread verifiers/serve/server/env_worker.py Outdated
) -> RunRolloutResponse:
async def handle_run_rollout(self, request: RunRolloutRequest) -> RunRolloutResponse:
if request.training_context is not None:
self.env.rubric.training_context = request.training_context
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent requests corrupt shared rubric training context

High Severity

training_context is set on the shared self.env.rubric instance before awaiting run_rollout/run_group. The worker's serve() method dispatches requests as concurrent asyncio tasks via asyncio.create_task, so a second request can overwrite self.env.rubric.training_context before the first request reaches its scoring phase. This causes rollouts to be scored with the wrong training context. The same issue exists in environment.py's local-mode path. Additionally, the if not None guard means a stale training_context from a previous request persists when a subsequent request omits it.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 475a603. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest push (cd2700c):

  1. Stale context: Removed the if not None guard — we now always assign training_context (even when None), so each request explicitly sets or clears it.

  2. Concurrency: In the current architecture, each EnvWorker processes requests sequentially through its event loop — training_context is set immediately before the scoring call within the same coroutine, so interleaving isn't possible within a single worker. The router distributes groups round-robin across workers, so no two concurrent requests share a rubric instance.


# Training context set by the orchestrator before scoring.
# Contains metadata like {"step": int, "ckpt_step": int}.
self.training_context: dict | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for new training context feature

Low Severity

This PR adds the user-facing training_context attribute to Rubric and new training_context parameters to Environment.run_rollout and Environment.run_group, but no corresponding updates were made to docs/reference.md or docs/environments.md, both of which document these classes and methods. Per project rules, PRs modifying core user-facing functionality described in docs/ must update the relevant documentation.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 475a603. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — will add documentation in a follow-up once the API stabilizes through review. The feature is intentionally minimal right now (optional dict, defaults to None) so existing code is unaffected.

Comment thread verifiers/envs/environment.py Outdated
)

if training_context is not None:
self.rubric.training_context = training_context
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training context not propagated to child rubrics in RubricGroup

High Severity

Setting self.rubric.training_context only assigns to the top-level rubric. Nearly all environment types (MultiTurnEnv, ToolEnv, SandboxEnv, etc.) call add_rubric() during init, which wraps the user's rubric in a RubricGroup. Since RubricGroup doesn't propagate training_context to its child rubrics, any custom rubric reading self.training_context in score_group or score_rollout will always see None. This makes the feature non-functional for all standard environment types.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 475a603. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest push (cd2700c):

Added a property override in RubricGroup that propagates training_context to all child rubrics on assignment. This ensures user rubrics wrapped in a RubricGroup (which is the standard path for MultiTurnEnv, ToolEnv, etc.) receive the context correctly.

@shriramc1 shriramc1 force-pushed the feat/training-context-rubric branch from 475a603 to bf0a711 Compare April 30, 2026 22:34
@shriramc1 shriramc1 marked this pull request as draft April 30, 2026 22:35
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shriramc1 shriramc1 force-pushed the feat/training-context-rubric branch from bf0a711 to cd2700c Compare April 30, 2026 22:42
@shriramc1 shriramc1 marked this pull request as ready for review April 30, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant