Filter zero-advantage samples in convert_samples_to_train_data#1901
Open
nanjiangwill wants to merge 1 commit into
Open
Filter zero-advantage samples in convert_samples_to_train_data#1901nanjiangwill wants to merge 1 commit into
nanjiangwill wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).
Changes:
- Add
--filter-zero-advantage-samples(requires--use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data. - Split rollout vs train-side logging responsibility by logging reward/length aggregates in
RolloutManager._log_rollout_dataand skipping them in Megatron-side rollout logging. - Update custom convert hook signature and its contract test to accept
(args, samples, raw_rewards, rewards).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/plugin_contracts/test_plugin_runtime_hook_contracts.py | Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs). |
| slime/utils/arguments.py | Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature. |
| slime/ray/rollout.py | Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging. |
| slime/backends/megatron_utils/data.py | Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in slime/ray/rollout.py. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1b9acf0 to
d2f22e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In
_convert_samples_to_train_data, after_post_process_rewards, drop samples whose post-processed reward is 0. Limited toadvantage_estimator in {grpo, gspo}(these compute per-token advantage as a scalar broadcast ofrewards, so r==0 ⇒ zero gradient; ppo/reinforce_plus_plus mix in values/kl/GAE so this isn't safe there).Caveat: some rollout loggings(e.g.
raw_reward) semantics got changed, the denominator is the filtered size not original size. this is wrong and need further refactor to make rollout loggings happened before entering training stage.