Skip to content

Fix routing replay split sizes for attention#721

Merged
vivekkalyan merged 2 commits into
mainfrom
fix/routing-replay-split-sizes
Jun 8, 2026
Merged

Fix routing replay split sizes for attention#721
vivekkalyan merged 2 commits into
mainfrom
fix/routing-replay-split-sizes

Conversation

@vivekkalyan

Copy link
Copy Markdown
Collaborator

Summary

  • preserve the full attention token layout when recording routing replay token UID sets
  • keep the compacted GDN token UID layout only for GDN replay
  • add a unit test that covers the differing attention-vs-GDN replay shapes

Why

The Megatron routing replay path could fail with split_with_sizes expects split_sizes to sum exactly ... when attention replay used compacted GDN token UID sets. Attention needs the original flattened token layout, while GDN uses the compact routed-token layout.

Validation

  • uv run --with torch --with safetensors --with megatron-core==0.17.0 --with transformers==5.2.0 --group dev pytest tests/unit/test_moe_routing_replay.py tests/unit/test_dedicated_config.py
  • Sky 2x H200 Bonnie Megatron repro against this fix completed 1 training step without the split-size crash
  • Stacked LoRA PR smoke run against this branch also completed 1 Megatron training step successfully

@FurtherAI

Copy link
Copy Markdown
Collaborator

Looks good, seems like a misconception from Codex and not caught by tests because they fill the packed seq and don't have padding.

@vivekkalyan vivekkalyan force-pushed the fix/routing-replay-split-sizes branch from 4834baf to 85081af Compare June 8, 2026 18:08
@vivekkalyan vivekkalyan merged commit f8eaa6d into main Jun 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants