build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375
build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375JacoCheung wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
core_v0.12.1 ships a ChainedOptimizer.step() that calls
self.count_zeros() unconditionally. count_zeros_fp32 then Python-loops
over every dense parameter doing grad.numel() - torch.count_nonzero(grad)
(5 CUDA kernels per param, ~360 launches per step on HSTU bf16) even
when log_num_zeros_in_grad is False and the result is discarded.
Megatron-LM commit d9608004f ("Add an option to skip counting zeros
in grad of ChainedOptimizer", 2025-06-04) added the missing
log_num_zeros_in_grad gate at the call site. Earliest tag containing
the fix is core_v0.13.0; using core_v0.13.1 (latest 0.13.x patch) for
stability.
Expected effect after the bump (no source changes in this repo): the
~4 ms / step count_zeros loop in ## optimizer step ## NVTX disappears
because HSTU never sets log_num_zeros_in_grad, so step() short-circuits
to None.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/build devel |
Greptile SummaryThis PR bumps the Megatron-LM dependency in Confidence Score: 4/5Safe to merge — the change is a well-justified, narrow version bump with detailed benchmarks and a clear test plan. Only P2 findings (mutable tag pin); no logic errors, no security concerns, and the PR description provides strong evidence of correctness and performance improvement. docker/Dockerfile — mutable git tag pin; otherwise no files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["docker build"] --> B["Layer 2: Megatron-LM\ngit clone -b core_v0.13.1"]
B --> C["pip install --no-deps -e ./megatron-lm"]
C --> D{"ChainedOptimizer.step"}
D -->|"0.12.1 (old)"| E["count_zeros_fp32 called unconditionally\n~360 CUDA kernels / step, ~4 ms wasted"]
D -->|"0.13.1 (new)"| F{"log_num_zeros_in_grad?"}
F -->|"True"| G["count_zeros_fp32 called"]
F -->|"False (HSTU default)"| H["skipped — 0 wasted kernels"]
Reviews (1): Last reviewed commit: "build(docker): bump Megatron-LM core_v0...." | Re-trigger Greptile |
|
❌ Pipeline #49886692 -- failed
Result: 8/17 jobs passed |
|
/build |
|
❌ Pipeline #49911823 -- failed
Result: 3/17 jobs passed |
|
/build |
|
❌ Pipeline #50621233 -- failed
Result: 3/17 jobs passed |
|
/build devel |
|
❌ Pipeline #50631678 -- failed
Result: 1/17 jobs passed |
Summary
docker/Dockerfilefromcore_v0.12.1tocore_v0.13.1.ChainedOptimizer.step()that callsself.count_zeros()unconditionally; on HSTU bf16 (single inner Float16Optimizer) this triggerscount_zeros_fp32's Python-per-param loop every step (~360 CUDA kernel launches, ~4 ms wall) even thoughlog_num_zeros_in_grad=Falseand the result is discarded.d9608004f("Add an option to skip counting zeros in grad of ChainedOptimizer") gates the call. First release tag containing it iscore_v0.13.0; this PR pinscore_v0.13.1(latest 0.13.x patch).Expected runtime effect (no source changes in this repo)
Measured on the existing
0.12.1image with HSTU exp4_tp (2-node EOS, 50 profiled steps, rank 0):## optimizer step ##wallThe fix is bit-exact (no parameter math change;
log_num_zeros_in_grad=Falsemeans the count was already going into a return slot nobody reads).Test plan
/build develto build newdevel_latestwith bumped Megatron pincount_zeros_fp32is gone from## optimizer step ##NVTX0.12.1baseline outside the optimizer phase🤖 Generated with Claude Code