Add Gram Newton-Schulz orthogonalization for Muon optimizer by delock · Pull Request #7953 · deepspeedai/DeepSpeed

delock · 2026-04-03T06:18:46Z

Summary

Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization method for the Muon optimizer, with a configurable ns_method switch to fall back to the original iteration when needed.

Based on the Gram Newton-Schulz method from https://tridao.me/blog/2026/gram-newton-schulz/

Motivation

Standard Newton-Schulz iterates on the full rectangular matrix X (n × m). Gram NS iterates on the much smaller Gram matrix R = X @ X.T (n × n), which is significantly cheaper when m >> n — the common case for transformer weight matrices (typical aspect ratio α ≈ 5).

Changes

Add zeropower_via_gram_newtonschulz in original_muon.py with fp16 compute (better precision than bf16 at the same cost)
and a restart at iteration 2 for half-precision stability
Add ns_method parameter ("gram" | "standard") to muon_update and all Muon optimizer classes
Thread ns_method through ZeRO Stage 1/2/3 call sites and DeepSpeed JSON config
Automatic fallback to standard NS for square matrices (m ≤ n) where Gram NS has no FLOP advantage
Documentation and unit tests for both methods across ZeRO Stage 1, 2, and 3

Usage

"optimizer": {                                                                                                               
    "type": "Muon",                                                                                                          
    "params": {                                                                                                              
        "ns_method": "gram"                                                                                                  
    }                                                                                                                        
}                                                                                                                            
                                                                                                                             
Set "ns_method": "standard" to disable Gram NS and revert to original behavior (e.g., for debugging convergence issues).

Performance improvement:

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fe6a0b4cf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T06:23:02Z

deepspeed/runtime/zero/stage3.py

                        raise ValueError(f"All Muon parameter groups must have the same momentum (beta). "
                                         f"Found {self.muon_beta} and {group_beta}.")
                    self.muon_beta = group_beta
+                    self.muon_ns_method = param_group.get('ns_method', 'gram')


Preserve per-group ns_method in ZeRO-3 Muon updates

The ZeRO-3 setup stores ns_method in a single optimizer-wide field that is overwritten for each Muon param group, and _apply_distributed_muon_update later uses that single value for all Muon subgroups. If a user configures multiple use_muon=True groups with different ns_method values, earlier groups silently run with the last group's method, producing incorrect optimizer behavior and invalid experiment comparisons. This should either enforce one shared ns_method (like momentum) or track/apply ns_method per subgroup.

Useful? React with 👍 / 👎.

ns_method actually is decided by ns_method field in json and cannot diverge.

Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization method for Muon, with a configurable ns_method switch to fall back to standard NS when needed (e.g., for debugging convergence issues). Gram NS iterates on the small square Gram matrix R = X @ X.T (n x n) instead of the full rectangular X (n x m), reducing FLOPs by ~50% for typical transformer weight matrices (aspect ratio ~5). It uses fp16 instead of bf16 for better numerical precision at the same compute cost, with a restart at iteration 2 for half-precision stability. Benchmark results on A100: - (2048, 11059): 2.25x GPU speedup, 1.85x CPU speedup - (3584, 19353): 2.07x GPU speedup, 1.35x CPU speedup - Falls back to standard NS for square matrices (no FLOP advantage) Usage: set ns_method in DeepSpeed config: {"optimizer": {"type": "muon", "params": {"ns_method": "gram"}}} Use "standard" to disable Gram NS and revert to original behavior. Reference: https://arxiv.org/abs/2503.02022 Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Both NS functions now query the accelerator to choose compute dtype instead of hardcoding. Standard NS uses is_bf16_supported() to select bf16 vs fp32; Gram NS uses is_fp16_supported() to select fp16 vs fp32. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Gram Newton-Schulz produces non-contiguous tensors via .mT for tall weight matrices (e.g., gate_proj/up_proj in LLaMA). This caused downstream grad norm computation (g.data.double()) to be ~1.8x slower due to strided memory access, adding ~75ms to optimizer step time. Add .contiguous() to the Gram NS return path for tall matrices, and ensure muon_update casts back to the original gradient dtype (Gram NS uses fp16 internally while gradients are bf16). Benchmark (Qwen2.5-3B, 2xA100, ZeRO-2, 3 runs avg): Before fix: 945.1ms/step (optimizer: 229.9ms) After fix: 936.6ms/step (optimizer: 204.3ms) Standard NS baseline: 1054.5ms/step Gram NS speedup: 10.4% -> 11.2% Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Replace (Q @ X).mT.contiguous() with X.mT @ Q.mT which produces a contiguous result directly. cuBLAS handles transposed inputs natively via transpose flags, so the matmul cost is identical but the extra memcpy from .contiguous() is eliminated. Benchmark (Qwen2.5-3B, 2xA100, ZeRO-2, 3 runs avg): Before: 936.6ms/step (backward: 628.4ms) After: 931.5ms/step (backward: 612.8ms) Speedup vs standard NS: 11.2% -> 11.7% Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Replace separate scalar-multiply + matmul + add operations with single torch.addmm calls for Q and R updates, reducing kernel launch overhead. Remove torch.eye allocation by using diagonal().add_() instead. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

delock · 2026-04-08T09:44:42Z

Convergence is almost identical between standard NS and gram NS, using same learning rate

delock requested review from loadams, tjruwase and tohtana as code owners April 3, 2026 06:18

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

delock force-pushed the gma/gram_muon branch from 4fe6a0b to d17212e Compare April 3, 2026 06:57

delock added 7 commits April 2, 2026 23:58

docs: add ns_method parameter to Muon optimizer documentation

e9beb2d

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix: correct Gram Newton-Schulz reference URL

d17212e

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

PKUWZP self-requested a review April 3, 2026 17:45

delock mentioned this pull request Apr 9, 2026

feat(zero2): add CPU offload support for Muon optimizer #7939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gram Newton-Schulz orthogonalization for Muon optimizer #7953

Add Gram Newton-Schulz orthogonalization for Muon optimizer #7953
delock wants to merge 7 commits intomasterfrom
gma/gram_muon

delock commented Apr 3, 2026 •

edited by PKUWZP

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

delock Apr 3, 2026

Uh oh!

delock commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delock commented Apr 3, 2026 • edited by PKUWZP Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

delock Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

delock commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

delock commented Apr 3, 2026 •

edited by PKUWZP

Loading

delock commented Apr 8, 2026 •

edited

Loading