Skip to content

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format #370

Description

@litiantian00

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format

Problem Summary

We are integrating DeepEP V2 ElasticBuffer with DeepGEMM for sglang's MoE dispatch/combine pipeline.

DeepEP V2's ElasticBuffer.dispatch() API returns two formats:

  1. do_expand=False (non-expanded/deduplicated): Returns [num_recv_tokens, H] 2D unsorted flat tensor + recv_topk_idx [N, topk] (which experts each token was routed to)
  2. do_expand=True (expanded): Returns [num_expanded_tokens, H] 2D flat tensor, already sorted by expert with expert-alignment padding

However, DeepGEMM's two grouped-GEMM variants have conflicting input format requirements:

GEMM Variant Expected Input Shape Key Requirement
Masked (grouped_gemm_nt_f8f8bf16_masked) [E_local, expected_m, K] (3D) Requires 3D layout with expert dimension; padding/masking via masked_m[e]
Contiguous (grouped_gemm_nt_f8f8bf16_contig) [M, K] (2D) Requires 2D flat layout sorted by expert; m_indices[m] tells which expert each row belongs to

The Core Issue

DeepEP V2 do_expand=False output matches neither format directly:

  • It returns 2D unsorted [N, H] (not sorted by expert, unlike contiguous GEMM)
  • Expert-assignment info is carried in a separate recv_topk_idx [N, topk] tensor rather than being implicit in layout
  • Forced format conversion: To use the non-expanded output with DeepGEMM masked GEMM, we must introduce additional kernels (scatter, reverse-scatter, slot-map construction, etc.) to reshape and reorder data, which adds non-trivial overhead per MoE layer

In contrast, the do_expand=True output is directly compatible with contiguous GEMM (already sorted by expert, just need to extract m_indices from psum_per_expert).

Our Configuration

We are adapting DeepEP V2 for sglang with the following setup:

  • Prefill path: CUDA Graph disabled, do_expand=True → contiguous GEMM (zero-copy, works well)
  • Decode path: CUDA Graph enabled, do_expand=False → masked GEMM (requires format conversion kernels)

The decode path is where the format mismatch creates a problem: do_expand=False is necessary for CUDA Graph compatibility, but its output does not directly map to either DeepGEMM variant without intermediate conversion kernels.

Questions for DeepGEMM Team

We would like to inquire whether there are any plans to provide a third DeepGEMM variant that natively supports the DeepEP V2 non-expanded output format, or if there are recommended solutions to efficiently bridge this format mismatch.

Specifically:

  1. Is there a planned or experimental variant that accepts unsorted 2D input with explicit topk_idx and topk_weights metadata?
  2. If not, would you recommend optimizing the format conversion kernels (scatter/reverse-scatter) as the standard approach for DeepEP V2 integration?
  3. Are there any performance considerations or best practices you would suggest for this integration pattern?

Any guidance would help us determine the best path forward for production deployment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions