Skip to content

[ckpt] feat: support MSC for fsdp_dtensors#3300

Open
pavelgein wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
pavelgein:fsdp_msc_checkpoints
Open

[ckpt] feat: support MSC for fsdp_dtensors#3300
pavelgein wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
pavelgein:fsdp_msc_checkpoints

Conversation

@pavelgein
Copy link
Copy Markdown
Contributor

@pavelgein pavelgein commented Apr 13, 2026

What does this PR do ?

Support checkpoints for fsdp_dtensors

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Summary by CodeRabbit

  • New Features

    • Added support for flexible checkpoint storage backend selection via feature flag.
  • Refactor

    • Centralized checkpoint reader and writer instantiation for improved consistency in checkpoint handling.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Updated checkpoint I/O operations in fsdp_dtensor to use helper functions and conditional storage writer selection based on MultiStorageClientFeature.is_enabled(). Both checkpoint saving and loading paths now delegate to centralized reader/writer creation helpers instead of directly instantiating PyTorch checkpoint classes.

Changes

Cohort / File(s) Summary
Checkpoint I/O Abstraction
src/megatron/bridge/training/checkpointing.py
Modified fsdp_dtensor checkpoint saving to conditionally select MultiStorageFileSystemWriter or torch.distributed.checkpoint.FileSystemWriter based on feature flag. Updated checkpoint loading in fsdp_dtensor and _load_fsdp_dtensor_base_checkpoint to use centralized reader helpers _get_checkpoint_reader() and _get_filesystem_reader() instead of direct class instantiation. Adjusted imports accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR implements major MSC feature for fsdp_dtensors but lacks documented test results, and contains critical bug: line 1917 calls undefined function _get_checkpoint_reader instead of _get_filesystem_reader. Provide documented test results demonstrating MSC checkpoint functionality works; fix line 1917 bug by replacing _get_checkpoint_reader with _get_filesystem_reader; update PR description with test evidence.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding MultiStorageClient (MSC) support for fsdp_dtensor checkpoints.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/checkpointing.py`:
- Line 1917: The code calls an undefined function _get_checkpoint_reader at
reader = _get_checkpoint_reader(checkpoint_name); replace that call with the
existing _get_filesystem_reader (the symbol imported earlier) so the code uses
_get_filesystem_reader(checkpoint_name) instead, and verify any surrounding
logic (e.g., variables used after reader is returned) still matches the reader
API used elsewhere (such as the other call at line ~2728).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0a4e457f-920d-4b46-9283-ebb0b40390c1

📥 Commits

Reviewing files that changed from the base of the PR and between 8dfe249 and e06901f.

📒 Files selected for processing (1)
  • src/megatron/bridge/training/checkpointing.py

Comment thread src/megatron/bridge/training/checkpointing.py Outdated
@pavelgein pavelgein force-pushed the fsdp_msc_checkpoints branch from e06901f to 5654829 Compare April 13, 2026 10:46
@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths needs-review PR is ready for code review and waiting on a reviewer labels Apr 13, 2026
@cuichenx cuichenx linked an issue Apr 13, 2026 that may be closed by this pull request
yaoyu-33
yaoyu-33 previously approved these changes Apr 13, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test 5654829

@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Apr 13, 2026
@pavelgein pavelgein force-pushed the fsdp_msc_checkpoints branch from 5654829 to 34e97a3 Compare April 14, 2026 05:54
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test 34e97a3

Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
@pavelgein pavelgein force-pushed the fsdp_msc_checkpoints branch from 34e97a3 to 4bb4fea Compare April 15, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] Support MultiStorageClient (MSC) for FSDP checkpoints

3 participants