build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work by JacoCheung · Pull Request #375 · NVIDIA/recsys-examples

JacoCheung · 2026-04-30T06:33:36Z

Summary

Bumps Megatron-LM in docker/Dockerfile from core_v0.12.1 to core_v0.13.1.
0.12.x line ships a ChainedOptimizer.step() that calls self.count_zeros() unconditionally; on HSTU bf16 (single inner Float16Optimizer) this triggers count_zeros_fp32's Python-per-param loop every step (~360 CUDA kernel launches, ~4 ms wall) even though log_num_zeros_in_grad=False and the result is discarded.
Megatron commit d9608004f ("Add an option to skip counting zeros in grad of ChainedOptimizer") gates the call. First release tag containing it is core_v0.13.0; this PR pins core_v0.13.1 (latest 0.13.x patch).

Expected runtime effect (no source changes in this repo)

Measured on the existing 0.12.1 image with HSTU exp4_tp (2-node EOS, 50 profiled steps, rank 0):

	baseline (0.12.1)	with the count_zeros loop removed
`## optimizer step ##` wall	6.90 ms	~2.77 ms
kernels inside optimizer NVTX	451 / step	~20 / step
MFU	16.79%	17.15%

The fix is bit-exact (no parameter math change; log_num_zeros_in_grad=False means the count was already going into a return slot nobody reads).

Test plan

/build devel to build new devel_latest with bumped Megatron pin
Submit 2-node nsys profile on EOS with the new image, confirm count_zeros_fp32 is gone from ## optimizer step ## NVTX
Confirm step time / MFU regression-free vs 0.12.1 baseline outside the optimizer phase
CI green

🤖 Generated with Claude Code

core_v0.12.1 ships a ChainedOptimizer.step() that calls self.count_zeros() unconditionally. count_zeros_fp32 then Python-loops over every dense parameter doing grad.numel() - torch.count_nonzero(grad) (5 CUDA kernels per param, ~360 launches per step on HSTU bf16) even when log_num_zeros_in_grad is False and the result is discarded. Megatron-LM commit d9608004f ("Add an option to skip counting zeros in grad of ChainedOptimizer", 2025-06-04) added the missing log_num_zeros_in_grad gate at the call site. Earliest tag containing the fix is core_v0.13.0; using core_v0.13.1 (latest 0.13.x patch) for stability. Expected effect after the bump (no source changes in this repo): the ~4 ms / step count_zeros loop in ## optimizer step ## NVTX disappears because HSTU never sets log_num_zeros_in_grad, so step() short-circuits to None. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JacoCheung · 2026-04-30T06:33:43Z

/build devel

greptile-apps · 2026-04-30T06:35:03Z

Greptile Summary

This PR bumps the Megatron-LM dependency in docker/Dockerfile from core_v0.12.1 to core_v0.13.1 to pick up commit d9608004f, which gates ChainedOptimizer.count_zeros on log_num_zeros_in_grad, eliminating ~360 wasted CUDA kernel launches (~4 ms) per optimizer step when the stat is unused. The change is well-documented inline and the PR description includes benchmark data showing the expected improvement.

Confidence Score: 4/5

Safe to merge — the change is a well-justified, narrow version bump with detailed benchmarks and a clear test plan.

Only P2 findings (mutable tag pin); no logic errors, no security concerns, and the PR description provides strong evidence of correctness and performance improvement.

docker/Dockerfile — mutable git tag pin; otherwise no files require special attention.

Important Files Changed

Filename	Overview
docker/Dockerfile	Single-line version bump of Megatron-LM git tag from core_v0.12.1 to core_v0.13.1, with a descriptive inline comment added; no SHA pin used, which is a minor reproducibility concern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["docker build"] --> B["Layer 2: Megatron-LM\ngit clone -b core_v0.13.1"]
    B --> C["pip install --no-deps -e ./megatron-lm"]
    C --> D{"ChainedOptimizer.step"}
    D -->|"0.12.1 (old)"| E["count_zeros_fp32 called unconditionally\n~360 CUDA kernels / step, ~4 ms wasted"]
    D -->|"0.13.1 (new)"| F{"log_num_zeros_in_grad?"}
    F -->|"True"| G["count_zeros_fp32 called"]
    F -->|"False (HSTU default)"| H["skipped — 0 wasted kernels"]

_{Reviews (1): Last reviewed commit: "build(docker): bump Megatron-LM core_v0...." | Re-trigger Greptile}

JacoCheung · 2026-04-30T06:38:02Z

❌ Pipeline #49886692 -- failed

Job	Status	Log
build_inference_devel	❌ failed	view
build_tritonserver_devel	❌ failed	view
pre_check	❌ failed	view
train_build	✅ success	view
inference_build	✅ success	view
tritonserver_build	✅ success	view
build_whl	✅ success	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	✅ success	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	❌ failed	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	✅ success	view
inference_unit_test_1gpu	✅ success	view
inference_test_1gpu	❌ failed	view
build_devel	✅ success	view
unit_test_1gpu_a100	❌ failed	view

Result: 8/17 jobs passed

View full pipeline

JacoCheung · 2026-04-30T12:19:25Z

/build

JacoCheung · 2026-04-30T12:23:56Z

❌ Pipeline #49911823 -- failed

Job	Status	Log
build_devel	❌ failed	view
build_inference_devel	❌ failed	view
build_tritonserver_devel	❌ failed	view
pre_check	❌ failed	view
train_build	❌ failed	view
inference_build	✅ success	view
tritonserver_build	✅ success	view
build_whl	❌ failed	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	❌ failed	view
unit_test_1gpu_a100	❌ failed	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	❌ failed	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	❌ failed	view
inference_unit_test_1gpu	✅ success	view
inference_test_1gpu	❌ failed	view

Result: 3/17 jobs passed

View full pipeline

JacoCheung · 2026-05-08T01:14:22Z

/build

JacoCheung · 2026-05-08T01:18:53Z

❌ Pipeline #50621233 -- failed

Job	Status	Log
build_devel	❌ failed	view
build_inference_devel	❌ failed	view
build_tritonserver_devel	❌ failed	view
pre_check	❌ failed	view
train_build	❌ failed	view
inference_build	✅ success	view
tritonserver_build	✅ success	view
build_whl	❌ failed	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	❌ failed	view
unit_test_1gpu_a100	❌ failed	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	❌ failed	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	❌ failed	view
inference_unit_test_1gpu	✅ success	view
inference_test_1gpu	❌ failed	view

Result: 3/17 jobs passed

View full pipeline

JacoCheung · 2026-05-08T03:52:23Z

/build devel

JacoCheung · 2026-05-08T03:57:24Z

❌ Pipeline #50631678 -- failed

Job	Status	Log
build_devel	✅ success	view
build_inference_devel	❌ failed	view
build_tritonserver_devel	❌ failed	view
pre_check	❌ failed	view
train_build	❌ failed	view
inference_build	❌ failed	view
tritonserver_build	❌ failed	view
build_whl	❌ failed	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	❌ failed	view
unit_test_1gpu_a100	❌ failed	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	❌ failed	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	❌ failed	view
inference_unit_test_1gpu	❌ failed	view
inference_test_1gpu	❌ failed	view

Result: 1/17 jobs passed

View full pipeline

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread docker/Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375

build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375
JacoCheung wants to merge 1 commit intoNVIDIA:mainfrom
JacoCheung:junzhang/mcore-version-bump

JacoCheung commented Apr 30, 2026

Uh oh!

JacoCheung commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Uh oh!

Uh oh!

JacoCheung commented Apr 30, 2026 •

edited

Loading

Uh oh!

JacoCheung commented Apr 30, 2026

Uh oh!

JacoCheung commented Apr 30, 2026 •

edited

Loading

Uh oh!

JacoCheung commented May 8, 2026

Uh oh!

JacoCheung commented May 8, 2026 •

edited

Loading

Uh oh!

JacoCheung commented May 8, 2026

Uh oh!

JacoCheung commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JacoCheung commented Apr 30, 2026

Summary

Expected runtime effect (no source changes in this repo)

Test plan

Uh oh!

JacoCheung commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

JacoCheung commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung commented Apr 30, 2026

Uh oh!

JacoCheung commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung commented May 8, 2026

Uh oh!

JacoCheung commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung commented May 8, 2026

Uh oh!

JacoCheung commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JacoCheung commented Apr 30, 2026 •

edited

Loading

JacoCheung commented Apr 30, 2026 •

edited

Loading

JacoCheung commented May 8, 2026 •

edited

Loading

JacoCheung commented May 8, 2026 •

edited

Loading