Filter zero-advantage samples in convert_samples_to_train_data by nanjiangwill · Pull Request #1901 · THUDM/slime

nanjiangwill · 2026-05-11T04:34:14Z

Summary

In _convert_samples_to_train_data, after _post_process_rewards, drop samples whose post-processed reward is 0. Limited to advantage_estimator in {grpo, gspo} (these compute per-token advantage as a scalar broadcast of rewards, so r==0 ⇒ zero gradient; ppo/reinforce_plus_plus mix in values/kl/GAE so this isn't safe there).

Caveat: some rollout loggings(e.g. raw_reward) semantics got changed, the denominator is the filtered size not original size. this is wrong and need further refactor to make rollout loggings happened before entering training stage.

Copilot

Pull request overview

This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).

Changes:

Add --filter-zero-advantage-samples (requires --use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data.
Split rollout vs train-side logging responsibility by logging reward/length aggregates in RolloutManager._log_rollout_data and skipping them in Megatron-side rollout logging.
Update custom convert hook signature and its contract test to accept (args, samples, raw_rewards, rewards).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/plugin_contracts/test_plugin_runtime_hook_contracts.py	Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs).
slime/utils/arguments.py	Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature.
slime/ray/rollout.py	Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging.
slime/backends/megatron_utils/data.py	Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in `slime/ray/rollout.py`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nanjiangwill changed the title ~~Filter zero-advantage samples; split rollout/train logging boundary~~ filter zero-advantage samples; split rollout/train logging boundary May 11, 2026

zhuzilin requested a review from Copilot May 11, 2026 08:35

Copilot started reviewing on behalf of zhuzilin May 11, 2026 08:35 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread slime/ray/rollout.py Outdated

Comment thread slime/ray/rollout.py Outdated

nanjiangwill changed the title ~~filter zero-advantage samples; split rollout/train logging boundary~~ Neutralize zero-advantage samples; split rollout/train logging boundary May 11, 2026

nanjiangwill changed the title ~~Neutralize zero-advantage samples; split rollout/train logging boundary~~ Neutralize zero-advantage samples to skip wasted forward compute May 11, 2026

nanjiangwill changed the title ~~Neutralize zero-advantage samples to skip wasted forward compute~~ Filter zero-advantage samples in convert_samples_to_train_data May 26, 2026

filter zero reward

d2f22e0

nanjiangwill force-pushed the filter-zero-reward branch from 1b9acf0 to d2f22e0 Compare May 26, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter zero-advantage samples in convert_samples_to_train_data#1901

Filter zero-advantage samples in convert_samples_to_train_data#1901
nanjiangwill wants to merge 1 commit into
mainfrom
filter-zero-reward

nanjiangwill commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nanjiangwill commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nanjiangwill commented May 11, 2026 •

edited

Loading