Skip to content

build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375

Open
JacoCheung wants to merge 1 commit intoNVIDIA:mainfrom
JacoCheung:junzhang/mcore-version-bump
Open

build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work#375
JacoCheung wants to merge 1 commit intoNVIDIA:mainfrom
JacoCheung:junzhang/mcore-version-bump

Conversation

@JacoCheung
Copy link
Copy Markdown
Collaborator

Summary

  • Bumps Megatron-LM in docker/Dockerfile from core_v0.12.1 to core_v0.13.1.
  • 0.12.x line ships a ChainedOptimizer.step() that calls self.count_zeros() unconditionally; on HSTU bf16 (single inner Float16Optimizer) this triggers count_zeros_fp32's Python-per-param loop every step (~360 CUDA kernel launches, ~4 ms wall) even though log_num_zeros_in_grad=False and the result is discarded.
  • Megatron commit d9608004f ("Add an option to skip counting zeros in grad of ChainedOptimizer") gates the call. First release tag containing it is core_v0.13.0; this PR pins core_v0.13.1 (latest 0.13.x patch).

Expected runtime effect (no source changes in this repo)

Measured on the existing 0.12.1 image with HSTU exp4_tp (2-node EOS, 50 profiled steps, rank 0):

baseline (0.12.1) with the count_zeros loop removed
## optimizer step ## wall 6.90 ms ~2.77 ms
kernels inside optimizer NVTX 451 / step ~20 / step
MFU 16.79% 17.15%

The fix is bit-exact (no parameter math change; log_num_zeros_in_grad=False means the count was already going into a return slot nobody reads).

Test plan

  • /build devel to build new devel_latest with bumped Megatron pin
  • Submit 2-node nsys profile on EOS with the new image, confirm count_zeros_fp32 is gone from ## optimizer step ## NVTX
  • Confirm step time / MFU regression-free vs 0.12.1 baseline outside the optimizer phase
  • CI green

🤖 Generated with Claude Code

core_v0.12.1 ships a ChainedOptimizer.step() that calls
self.count_zeros() unconditionally. count_zeros_fp32 then Python-loops
over every dense parameter doing grad.numel() - torch.count_nonzero(grad)
(5 CUDA kernels per param, ~360 launches per step on HSTU bf16) even
when log_num_zeros_in_grad is False and the result is discarded.

Megatron-LM commit d9608004f ("Add an option to skip counting zeros
in grad of ChainedOptimizer", 2025-06-04) added the missing
log_num_zeros_in_grad gate at the call site. Earliest tag containing
the fix is core_v0.13.0; using core_v0.13.1 (latest 0.13.x patch) for
stability.

Expected effect after the bump (no source changes in this repo): the
~4 ms / step count_zeros loop in ## optimizer step ## NVTX disappears
because HSTU never sets log_num_zeros_in_grad, so step() short-circuits
to None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build devel

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR bumps the Megatron-LM dependency in docker/Dockerfile from core_v0.12.1 to core_v0.13.1 to pick up commit d9608004f, which gates ChainedOptimizer.count_zeros on log_num_zeros_in_grad, eliminating ~360 wasted CUDA kernel launches (~4 ms) per optimizer step when the stat is unused. The change is well-documented inline and the PR description includes benchmark data showing the expected improvement.

Confidence Score: 4/5

Safe to merge — the change is a well-justified, narrow version bump with detailed benchmarks and a clear test plan.

Only P2 findings (mutable tag pin); no logic errors, no security concerns, and the PR description provides strong evidence of correctness and performance improvement.

docker/Dockerfile — mutable git tag pin; otherwise no files require special attention.

Important Files Changed

Filename Overview
docker/Dockerfile Single-line version bump of Megatron-LM git tag from core_v0.12.1 to core_v0.13.1, with a descriptive inline comment added; no SHA pin used, which is a minor reproducibility concern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["docker build"] --> B["Layer 2: Megatron-LM\ngit clone -b core_v0.13.1"]
    B --> C["pip install --no-deps -e ./megatron-lm"]
    C --> D{"ChainedOptimizer.step"}
    D -->|"0.12.1 (old)"| E["count_zeros_fp32 called unconditionally\n~360 CUDA kernels / step, ~4 ms wasted"]
    D -->|"0.13.1 (new)"| F{"log_num_zeros_in_grad?"}
    F -->|"True"| G["count_zeros_fp32 called"]
    F -->|"False (HSTU default)"| H["skipped — 0 wasted kernels"]
Loading

Reviews (1): Last reviewed commit: "build(docker): bump Megatron-LM core_v0...." | Re-trigger Greptile

Comment thread docker/Dockerfile
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 30, 2026

Pipeline #49886692 -- failed

Job Status Log
build_inference_devel ❌ failed view
build_tritonserver_devel ❌ failed view
pre_check ❌ failed view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ✅ success view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ✅ success view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ❌ failed view
build_devel ✅ success view
unit_test_1gpu_a100 ❌ failed view

Result: 8/17 jobs passed

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 30, 2026

Pipeline #49911823 -- failed

Job Status Log
build_devel ❌ failed view
build_inference_devel ❌ failed view
build_tritonserver_devel ❌ failed view
pre_check ❌ failed view
train_build ❌ failed view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ❌ failed view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ❌ failed view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ❌ failed view

Result: 3/17 jobs passed

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented May 8, 2026

Pipeline #50621233 -- failed

Job Status Log
build_devel ❌ failed view
build_inference_devel ❌ failed view
build_tritonserver_devel ❌ failed view
pre_check ❌ failed view
train_build ❌ failed view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ❌ failed view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ❌ failed view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ❌ failed view

Result: 3/17 jobs passed

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build devel

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented May 8, 2026

Pipeline #50631678 -- failed

Job Status Log
build_devel ✅ success view
build_inference_devel ❌ failed view
build_tritonserver_devel ❌ failed view
pre_check ❌ failed view
train_build ❌ failed view
inference_build ❌ failed view
tritonserver_build ❌ failed view
build_whl ❌ failed view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ❌ failed view
inference_unit_test_1gpu ❌ failed view
inference_test_1gpu ❌ failed view

Result: 1/17 jobs passed

View full pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant