Add per-step reward field to Action and Observation schemas#183
Open
henryjcee wants to merge 1 commit intoneulab:mainfrom
Open
Add per-step reward field to Action and Observation schemas#183henryjcee wants to merge 1 commit intoneulab:mainfrom
henryjcee wants to merge 1 commit intoneulab:mainfrom
Conversation
…optional `reward: float | None` field to the base `Action` and `Observation` classes,enabling RL training data to carry per-step reward signals. All six concrete action/observation types inherit the field. Existing datasets are unaffected as the field defaults to None.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reward: float | Nonefield to the baseActionclass (schema/action/action.py), inherited byApiAction,CodeAction, andMessageActionreward: float | Nonefield to the baseObservationclass (schema/observation/observation.py), inherited byTextObservation,WebObservation, andImageObservationschema/SCHEMA.mdto document the new field on both base classesMotivation
ADP currently has no mechanism to attach reward signals to individual trajectory steps. This makes it difficult to use ADP-formatted data for reinforcement learning, where per-step rewards are a core primitive.
This change adds
rewardas a first-class optional field on every action and observation, allowing datasets to record the reward received at each step of a trajectory. It's conceivable that some RL settings may provide reward with an observation or at action time, this change supports both approaches.Design notes
rewarddefaults toNone— fully backwards-compatible, all existingsample_std.jsonfiles validate without modificationfloatscalar (not a distribution or vector) to keep the schema simple and composableTests
pytest tests/test_standardized_schemas.py— all 33 datasets passI don't think new tests are required by this but happy to add if useful.