[Feature] Enable AutoEP Compatibility with ZeRO-3 by nathon-lee · Pull Request #7928 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-28T09:09:17Z

[Feature] Enable AutoEP Compatibility with ZeRO-3

📌 Summary

This PR introduces compatibility between AutoEP (Expert Parallelism) and ZeRO-3.

AutoEP has historically relied on ZeRO-2 due to inherent conflicts between expert-parallel parameter partitioning and ZeRO-3’s data-parallel sharding. This PR resolves those conflicts through a minimal and targeted decoupling strategy, allowing:

Expert parameters to follow AutoEP semantics
Non-expert parameters (e.g., attention, embeddings) to fully benefit from ZeRO-3 sharding

This preserves AutoEP’s high-throughput execution while unlocking the memory efficiency of ZeRO-3 where applicable.

🔍 Design Overview

Instead of modifying core ZeRO-3 logic, this PR selectively bypasses ZeRO-3 mechanisms for expert parameters, while keeping the default behavior unchanged for all other parameters.

The implementation consists of four focused components:

1. Parameter Partition Bypass

Expert parameters are tagged (_autoep_expert=True) and excluded from ZeRO-3 partitioning and gathering logic.

2. Gradient Reduction Isolation

Expert gradients bypass ZeRO-3 reduce-scatter and instead use all_reduce within the EP data-parallel group, matching AutoEP semantics.

3. Optimizer State Isolation

A dedicated optimizer is introduced for expert parameters, along with FP32 master weights to ensure numerical stability during updates.

4. Checkpoint Compatibility

Expert parameters and their optimizer states are explicitly integrated into checkpoint save/load paths to ensure correct training resumption.

✅ Benefits

Enables AutoEP + ZeRO-3 co-existence
Reduces memory footprint for non-expert parameters via ZeRO-3
Preserves AutoEP’s performance characteristics (Grouped-GEMM + AllToAll)
Keeps changes localized without impacting standard ZeRO-3 workflows

⚠️ Trade-offs

Expert parameters are not sharded by ZeRO-3
Their memory footprint remains similar to AutoEP + ZeRO-2

🧪 Testing

Verified end-to-end training correctness
Added unit tests for:
- Gradient reduction isolation
- Optimizer state handling
- Checkpoint save/load consistency

Due to limited GPU resources, validation has been performed on 2 GPUs.

If additional resources (e.g., 8 GPUs) are available, I would be very happy to further validate scalability and robustness. The additional verification should only require a few hours.

🙏 Notes

Feedback and suggestions are very welcome.
If possible, I would greatly appreciate access to larger-scale testing resources to further strengthen validation.

References

Refs: (Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861 (Q2 2026 roadmap — AutoTP Universal Checkpoint support)
DeepSpeedExamples ：[tohtana/add_auto_ep/training/expert_parallel]

Signed-off-by: nathon-lee [leejianwoo@gmail.com]

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Revert "fix: update 1 file reformatted." (ff88670)

This reverts commit b90aee5.

Revert accidental Muon optimizer code re-introduction from copilot PRs

Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: move torch.distributed as dist Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: update docs _tutorials autoep.md . Signed-off-by: nathon-lee <leejianwoo@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d231f6b3bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-28T09:17:05Z

deepspeed/module_inject/auto_ep_layer.py

+        if ep_group is not None:
+            self.ep_group = ep_group
+            self.ep_group_name = f"ep_group_{id(ep_group)}"
+            for param in self.experts.parameters():


Preserve process-group handle in AutoEP parallelism setup

DeepSpeedEngine._configure_distributed_model passes a boolean (use_data_before_expert_parallel_) into every module's set_deepspeed_parallelism, but this method treats any non-None value as an EP process group and overwrites self.ep_group with True/False. Once that happens, AutoEP forward calls (all_gather/all_to_all_single) run with a boolean instead of a process group and fail at runtime as soon as the layer executes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-28T09:17:05Z

deepspeed/module_inject/auto_ep_layer.py

+        self.reorderer = TokenReorderer(
+            num_experts=self.num_local_experts,
+            top_k=spec.top_k,


Use global expert cardinality for token reordering

The router emits expert IDs in the global range [0, num_experts), but TokenReorderer is initialized with num_local_experts. For ep_size > 1, expert IDs outside the local range are mis-bucketed/dropped by the histogram logic, so token counts no longer match the sorted token stream; this corrupts dispatch metadata and can trigger incorrect routing or downstream shape/index failures in multi-rank EP runs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-28T09:17:05Z

deepspeed/runtime/zero/stage3.py

+        if not hasattr(self, '_autoep_expert_optimizer'):
+            optimizer_cls = type(self.optimizer)
+            base_group = self.optimizer.param_groups[0]
+            expert_group = {k: v for k, v in base_group.items() if k != 'params'}
+            expert_group['params'] = expert_params


Keep expert optimizer hyperparameters in schedule sync

The dedicated AutoEP expert optimizer is created once from self.optimizer.param_groups[0] and then reused without any hyperparameter refresh. If a scheduler (or manual LR/WD update) changes the main optimizer during training, expert params keep stale hyperparameters while non-expert params follow the new values, causing silent optimization drift between parameter sets.

Useful? React with 👍 / 👎.

tohtana · 2026-03-29T21:37:43Z

Thank you, @nathon-lee! This is amazing.
AutoEP is not officially released yet. Did you use my fork?

Maybe we should focus on merging the branch first? I have left it for a while, but I will prioritize it if you can help me.

nathon-lee · 2026-03-30T01:49:39Z

Thank you, @nathon-lee! This is amazing. AutoEP is not officially released yet. Did you use my fork?

Maybe we should focus on merging the branch first? I have left it for a while, but I will prioritize it if you can help me.

@tohtana Thanks for pointing this out — you’re right. I did use tohtana/DeepSpeedExamples/training/expert_parallel as a reference, and I should have acknowledged that more clearly.

I picked this up because ZeRO-3 compatibility for AutoEP did seem to be covered in the 2026 roadmap. This wasn’t meant as a direct port of tohtana/add_autoep, but it was definitely informed by your earlier work. I’ll update the PR description and address the review comments first. I’d really appreciate your guidance on how best to align it, and I’d be very happy to collaborate and revise it accordingly.

tohtana · 2026-03-31T00:42:18Z

Hi @nathon-lee,
I found this PR is missing some features (universal checkpoint support, metadata saving/loading, some EP implementation, tests) in my branch. I opened a new PR (#7938) based on my branch.
So, how about merging this PR to #7938?

nathon-lee · 2026-03-31T01:28:55Z

Hi @nathon-lee, I found this PR is missing some features (universal checkpoint support, metadata saving/loading, some EP implementation, tests) in my branch. I opened a new PR based on my branch. So, how about merging this PR to #7938?

Hi @tohtana, thanks for the heads-up and for adding the missing features on top of your branch.
I’m fine with proceeding with #7938 as the main PR. Once we confirm everything is covered there, we can close this PR.

tohtana · 2026-03-31T22:40:26Z

@nathon-lee #7938 is missing Z3 support. Do you think you can add it? What about creating a new PR focusing on Z3 support and merge it to #7938.

nathon-lee · 2026-04-01T03:38:10Z

@nathon-lee #7938 is missing Z3 support. Do you think you can add it? What about creating a new PR focusing on Z3 support and merge it to #7938.

@tohtana ok

nathon-lee · 2026-04-01T04:10:49Z

@nathon-lee #7938缺少 Z3 支持。您认为您可以添加吗？不如创建一个新的 PR，专门用于添加 Z3 支持，然后将其合并到#7938中。

@tohtana 好的

I’ll probably wait until your AutoEP branch is merged into main before opening my PR, since my changes depend on your branch.

Copilot AI and others added 10 commits February 27, 2026 06:30

Initial plan

001f77c

Revert "fix: update 1 file reformatted."

b90aee5

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Merge pull request #5 from nathon-lee/copilot/git-revert-ff886701

b6da9af

Revert "fix: update 1 file reformatted." (ff88670)

Merge branch 'deepspeedai:master' into master

bb7f64f

Initial plan

cbc816c

Reapply "fix: update 1 file reformatted."

5fcc9a7

This reverts commit b90aee5.

Merge pull request #6 from nathon-lee/copilot/remove-commits-from-master

f7c5d75

Revert accidental Muon optimizer code re-introduction from copilot PRs

Merge branch 'deepspeedai:master' into master

18efbcc

Merge branch 'deepspeedai:master' into master

e2ac74d

feat(moe): support AutoEP with ZeRO-3 and add tests

d231f6b

Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: move torch.distributed as dist Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: update docs _tutorials autoep.md . Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee requested review from hwchen2017, loadams, tjruwase and tohtana as code owners March 28, 2026 09:09

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

PKUWZP self-requested a review March 28, 2026 15:33

Merge branch 'master' into feat_autoEP_zero3

9f7ada8

tohtana mentioned this pull request Mar 31, 2026

Add AutoEP #7938

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable AutoEP Compatibility with ZeRO-3#7928

[Feature] Enable AutoEP Compatibility with ZeRO-3#7928
nathon-lee wants to merge 11 commits intodeepspeedai:masterfrom
nathon-lee:feat_autoEP_zero3

nathon-lee commented Mar 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Uh oh!

tohtana commented Mar 29, 2026

Uh oh!

nathon-lee commented Mar 30, 2026 •

edited

Loading

Uh oh!

tohtana commented Mar 31, 2026 •

edited

Loading

Uh oh!

nathon-lee commented Mar 31, 2026

Uh oh!

tohtana commented Mar 31, 2026

Uh oh!

nathon-lee commented Apr 1, 2026

Uh oh!

nathon-lee commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nathon-lee commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!