Expose hardcoded Megatron infrastructure parameters to user config by nic-nvidia · Pull Request #2230 · NVIDIA-NeMo/RL

nic-nvidia · 2026-04-08T06:03:24Z

Summary

Read checkpoint, timeout, and diagnostic settings from megatron_cfg with backward-compatible defaults instead of hardcoding them in setup.py
New optional megatron_cfg fields: async_save, fully_parallel_save, fully_parallel_load, load_rng, distributed_timeout_minutes, logging_level
New optional distributed_data_parallel_config field: check_for_nan_in_grad
All defaults match the existing hardcoded values — no behavior change without explicit config

Test plan

Existing test_basic_checkpoint_config passes (backward compat, no config arg)
New test_checkpoint_config_overrides validates all 4 checkpoint fields
CI passes with no config changes (defaults preserved)

Read checkpoint, timeout, and diagnostic settings from megatron_cfg with backward-compatible defaults instead of hardcoding them. New megatron_cfg fields (all optional, existing defaults preserved): - async_save, fully_parallel_save, fully_parallel_load, load_rng - distributed_timeout_minutes - logging_level New distributed_data_parallel_config field: - check_for_nan_in_grad Closes NVIDIA-NeMo#2229

copy-pr-bot · 2026-04-08T06:03:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuki-97

Hi @nic-nvidia , thanks for the enhancement! LGTM except the default config place. Could you help to update?

Also @yaoyu-33 @cuichenx , could you help to check whether the params in this PR are well supported in MBrdige?

yuki-97 · 2026-04-11T04:25:44Z

        # Step 1: Setup distributed
-        setup_distributed()
+        setup_distributed(
+            timeout_minutes=config.get("megatron_cfg", {}).get("distributed_timeout_minutes"),


We encourage to set default value in config.yaml instead of in code, so that people can know what feature we have and their default behavior w/o looking into the code.

Can you help to:

Update to the below, also other configs

Add the param (set to the default value) to several base configs? (other configs will inherit from the base one so don't need to change)

examples/configs/distillation_math.yaml

examples/configs/dpo.yaml

examples/configs/grpo_math_1B.yaml

examples/configs/rm.yaml

examples/configs/sft.yaml

examples/nemo_gym/grpo_nanov3.yaml

examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml

research/template_project/configs/grpo_math_1B.yaml

Suggested change

timeout_minutes=config.get("megatron_cfg", {}).get("distributed_timeout_minutes"),

timeout_minutes=config["megatron_cfg"]["distributed_timeout_minutes"],

similarly this is cfg.dist.distributed_timeout_minutes in megatron bridge. just chekcing this is fine.

cuichenx · 2026-04-13T23:12:24Z

+        async_save=(config or {}).get("megatron_cfg", {}).get("async_save", False),
+        fully_parallel_save=(config or {}).get("megatron_cfg", {}).get("fully_parallel_save", True),
+        fully_parallel_load=(config or {}).get("megatron_cfg", {}).get("fully_parallel_load", True),
+        load_rng=(config or {}).get("megatron_cfg", {}).get("load_rng", False),


these args are access from megatron as cfg.checkpoint.async_save but I don't see the "checkpoint" part here, could you double check this part

nic-nvidia added 2 commits April 7, 2026 22:49

Add test for checkpoint config overrides via megatron_cfg

74688cf

nic-nvidia requested review from a team as code owners April 8, 2026 06:03

github-actions bot added the community-request label Apr 8, 2026

nic-nvidia mentioned this pull request Apr 8, 2026

Fix save_megatron_model deadlock: pass fully_parallel_save=False during conversion #2226

Open

3 tasks

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 10, 2026

yuki-97 reviewed Apr 11, 2026

View reviewed changes

yuki-97 requested review from cuichenx and yaoyu-33 April 11, 2026 04:28

chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 11, 2026

cuichenx reviewed Apr 13, 2026

View reviewed changes

chtruong814 added waiting-for-customer Waiting for response from the original author and removed waiting-for-customer Waiting for response from the original author labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose hardcoded Megatron infrastructure parameters to user config#2230

Expose hardcoded Megatron infrastructure parameters to user config#2230
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:expose-infra-params

nic-nvidia commented Apr 8, 2026

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

yuki-97 left a comment •

edited

Loading

Uh oh!

yuki-97 Apr 11, 2026

Uh oh!

cuichenx Apr 13, 2026

Uh oh!

cuichenx Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	timeout_minutes=config.get("megatron_cfg", {}).get("distributed_timeout_minutes"),
	timeout_minutes=config["megatron_cfg"]["distributed_timeout_minutes"],

Conversation

nic-nvidia commented Apr 8, 2026

Summary

Test plan

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

yuki-97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuki-97 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuki-97 left a comment •

edited

Loading