Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format
Problem Summary
We are integrating DeepEP V2 ElasticBuffer with DeepGEMM for sglang's MoE dispatch/combine pipeline.
DeepEP V2's ElasticBuffer.dispatch() API returns two formats:
do_expand=False (non-expanded/deduplicated): Returns [num_recv_tokens, H] 2D unsorted flat tensor + recv_topk_idx [N, topk] (which experts each token was routed to)
do_expand=True (expanded): Returns [num_expanded_tokens, H] 2D flat tensor, already sorted by expert with expert-alignment padding
However, DeepGEMM's two grouped-GEMM variants have conflicting input format requirements:
| GEMM Variant |
Expected Input Shape |
Key Requirement |
| Masked (grouped_gemm_nt_f8f8bf16_masked) |
[E_local, expected_m, K] (3D) |
Requires 3D layout with expert dimension; padding/masking via masked_m[e] |
| Contiguous (grouped_gemm_nt_f8f8bf16_contig) |
[M, K] (2D) |
Requires 2D flat layout sorted by expert; m_indices[m] tells which expert each row belongs to |
The Core Issue
DeepEP V2 do_expand=False output matches neither format directly:
- It returns 2D unsorted
[N, H] (not sorted by expert, unlike contiguous GEMM)
- Expert-assignment info is carried in a separate
recv_topk_idx [N, topk] tensor rather than being implicit in layout
- Forced format conversion: To use the non-expanded output with DeepGEMM masked GEMM, we must introduce additional kernels (scatter, reverse-scatter, slot-map construction, etc.) to reshape and reorder data, which adds non-trivial overhead per MoE layer
In contrast, the do_expand=True output is directly compatible with contiguous GEMM (already sorted by expert, just need to extract m_indices from psum_per_expert).
Our Configuration
We are adapting DeepEP V2 for sglang with the following setup:
- Prefill path: CUDA Graph disabled,
do_expand=True → contiguous GEMM (zero-copy, works well)
- Decode path: CUDA Graph enabled,
do_expand=False → masked GEMM (requires format conversion kernels)
The decode path is where the format mismatch creates a problem: do_expand=False is necessary for CUDA Graph compatibility, but its output does not directly map to either DeepGEMM variant without intermediate conversion kernels.
Questions for DeepGEMM Team
We would like to inquire whether there are any plans to provide a third DeepGEMM variant that natively supports the DeepEP V2 non-expanded output format, or if there are recommended solutions to efficiently bridge this format mismatch.
Specifically:
- Is there a planned or experimental variant that accepts unsorted 2D input with explicit
topk_idx and topk_weights metadata?
- If not, would you recommend optimizing the format conversion kernels (scatter/reverse-scatter) as the standard approach for DeepEP V2 integration?
- Are there any performance considerations or best practices you would suggest for this integration pattern?
Any guidance would help us determine the best path forward for production deployment.
Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format
Problem Summary
We are integrating DeepEP V2 ElasticBuffer with DeepGEMM for sglang's MoE dispatch/combine pipeline.
DeepEP V2's
ElasticBuffer.dispatch()API returns two formats:do_expand=False(non-expanded/deduplicated): Returns[num_recv_tokens, H]2D unsorted flat tensor +recv_topk_idx [N, topk](which experts each token was routed to)do_expand=True(expanded): Returns[num_expanded_tokens, H]2D flat tensor, already sorted by expert with expert-alignment paddingHowever, DeepGEMM's two grouped-GEMM variants have conflicting input format requirements:
[E_local, expected_m, K](3D)masked_m[e][M, K](2D)m_indices[m]tells which expert each row belongs toThe Core Issue
DeepEP V2
do_expand=Falseoutput matches neither format directly:[N, H](not sorted by expert, unlike contiguous GEMM)recv_topk_idx [N, topk]tensor rather than being implicit in layoutIn contrast, the
do_expand=Trueoutput is directly compatible with contiguous GEMM (already sorted by expert, just need to extractm_indicesfrompsum_per_expert).Our Configuration
We are adapting DeepEP V2 for sglang with the following setup:
do_expand=True→ contiguous GEMM (zero-copy, works well)do_expand=False→ masked GEMM (requires format conversion kernels)The decode path is where the format mismatch creates a problem:
do_expand=Falseis necessary for CUDA Graph compatibility, but its output does not directly map to either DeepGEMM variant without intermediate conversion kernels.Questions for DeepGEMM Team
We would like to inquire whether there are any plans to provide a third DeepGEMM variant that natively supports the DeepEP V2 non-expanded output format, or if there are recommended solutions to efficiently bridge this format mismatch.
Specifically:
topk_idxandtopk_weightsmetadata?Any guidance would help us determine the best path forward for production deployment.