[ckpt] feat: support MSC for fsdp_dtensors#3300
[ckpt] feat: support MSC for fsdp_dtensors#3300pavelgein wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughUpdated checkpoint I/O operations in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/training/checkpointing.py`:
- Line 1917: The code calls an undefined function _get_checkpoint_reader at
reader = _get_checkpoint_reader(checkpoint_name); replace that call with the
existing _get_filesystem_reader (the symbol imported earlier) so the code uses
_get_filesystem_reader(checkpoint_name) instead, and verify any surrounding
logic (e.g., variables used after reader is returned) still matches the reader
API used elsewhere (such as the other call at line ~2728).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0a4e457f-920d-4b46-9283-ebb0b40390c1
📒 Files selected for processing (1)
src/megatron/bridge/training/checkpointing.py
e06901f to
5654829
Compare
|
/ok to test 5654829 |
5654829 to
34e97a3
Compare
|
/ok to test 34e97a3 |
Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
Signed-off-by: Pavel Gein <pavel.gein@gmail.com>
34e97a3 to
4bb4fea
Compare
What does this PR do ?
Support checkpoints for fsdp_dtensors
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
[feature] Support MultiStorageClient (MSC) for FSDP checkpoints #3261
Summary by CodeRabbit
New Features
Refactor