Skip to content

Filter zero-advantage samples in convert_samples_to_train_data#1901

Open
nanjiangwill wants to merge 1 commit into
mainfrom
filter-zero-reward
Open

Filter zero-advantage samples in convert_samples_to_train_data#1901
nanjiangwill wants to merge 1 commit into
mainfrom
filter-zero-reward

Conversation

@nanjiangwill
Copy link
Copy Markdown
Collaborator

@nanjiangwill nanjiangwill commented May 11, 2026

Summary

In _convert_samples_to_train_data, after _post_process_rewards, drop samples whose post-processed reward is 0. Limited to advantage_estimator in {grpo, gspo} (these compute per-token advantage as a scalar broadcast of rewards, so r==0 ⇒ zero gradient; ppo/reinforce_plus_plus mix in values/kl/GAE so this isn't safe there).

Caveat: some rollout loggings(e.g. raw_reward) semantics got changed, the denominator is the filtered size not original size. this is wrong and need further refactor to make rollout loggings happened before entering training stage.

@nanjiangwill nanjiangwill changed the title Filter zero-advantage samples; split rollout/train logging boundary filter zero-advantage samples; split rollout/train logging boundary May 11, 2026
@zhuzilin zhuzilin requested a review from Copilot May 11, 2026 08:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).

Changes:

  • Add --filter-zero-advantage-samples (requires --use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data.
  • Split rollout vs train-side logging responsibility by logging reward/length aggregates in RolloutManager._log_rollout_data and skipping them in Megatron-side rollout logging.
  • Update custom convert hook signature and its contract test to accept (args, samples, raw_rewards, rewards).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/plugin_contracts/test_plugin_runtime_hook_contracts.py Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs).
slime/utils/arguments.py Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature.
slime/ray/rollout.py Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging.
slime/backends/megatron_utils/data.py Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in slime/ray/rollout.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread slime/ray/rollout.py Outdated
Comment thread slime/ray/rollout.py Outdated
@nanjiangwill nanjiangwill changed the title filter zero-advantage samples; split rollout/train logging boundary Neutralize zero-advantage samples; split rollout/train logging boundary May 11, 2026
@nanjiangwill nanjiangwill changed the title Neutralize zero-advantage samples; split rollout/train logging boundary Neutralize zero-advantage samples to skip wasted forward compute May 11, 2026
@nanjiangwill nanjiangwill changed the title Neutralize zero-advantage samples to skip wasted forward compute Filter zero-advantage samples in convert_samples_to_train_data May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants