Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format

# Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format

## Problem Summary

We are integrating **DeepEP V2 ElasticBuffer** with **DeepGEMM** for sglang's MoE dispatch/combine pipeline.

DeepEP V2's `ElasticBuffer.dispatch()` API returns two formats:
1. **`do_expand=False` (non-expanded/deduplicated)**: Returns `[num_recv_tokens, H]` 2D unsorted flat tensor + `recv_topk_idx [N, topk]` (which experts each token was routed to)
2. **`do_expand=True` (expanded)**: Returns `[num_expanded_tokens, H]` 2D flat tensor, already sorted by expert with expert-alignment padding

However, **DeepGEMM's two grouped-GEMM variants have conflicting input format requirements**:

| GEMM Variant | Expected Input Shape | Key Requirement |
|---|---|---|
| **Masked (grouped_gemm_nt_f8f8bf16_masked)** | `[E_local, expected_m, K]` (3D) | Requires **3D layout** with expert dimension; padding/masking via `masked_m[e]` |
| **Contiguous (grouped_gemm_nt_f8f8bf16_contig)** | `[M, K]` (2D) | Requires **2D flat** layout **sorted by expert**; `m_indices[m]` tells which expert each row belongs to |

## The Core Issue

**DeepEP V2 `do_expand=False` output matches neither format directly**:

- It returns **2D unsorted** `[N, H]` (not sorted by expert, unlike contiguous GEMM)
- Expert-assignment info is carried in a separate `recv_topk_idx [N, topk]` tensor rather than being implicit in layout
- **Forced format conversion**: To use the non-expanded output with DeepGEMM masked GEMM, we must introduce additional kernels (scatter, reverse-scatter, slot-map construction, etc.) to reshape and reorder data, which adds non-trivial overhead per MoE layer

In contrast, the **`do_expand=True` output is directly compatible** with contiguous GEMM (already sorted by expert, just need to extract `m_indices` from `psum_per_expert`).

## Our Configuration

We are adapting DeepEP V2 for sglang with the following setup:
- **Prefill path**: CUDA Graph disabled, `do_expand=True` → contiguous GEMM (zero-copy, works well)
- **Decode path**: CUDA Graph enabled, `do_expand=False` → masked GEMM (requires format conversion kernels)

The decode path is where the format mismatch creates a problem: `do_expand=False` is necessary for CUDA Graph compatibility, but its output does not directly map to either DeepGEMM variant without intermediate conversion kernels.

## Questions for DeepGEMM Team

We would like to inquire whether there are **any plans to provide a third DeepGEMM variant** that natively supports the DeepEP V2 non-expanded output format, or if there are **recommended solutions** to efficiently bridge this format mismatch.

Specifically:
1. Is there a **planned or experimental variant** that accepts unsorted 2D input with explicit `topk_idx` and `topk_weights` metadata?
2. If not, would you recommend **optimizing the format conversion kernels** (scatter/reverse-scatter) as the standard approach for DeepEP V2 integration?
3. Are there any **performance considerations or best practices** you would suggest for this integration pattern?

Any guidance would help us determine the best path forward for production deployment.

GEMM Variant	Expected Input Shape	Key Requirement
Masked (grouped_gemm_nt_f8f8bf16_masked)	`[E_local, expected_m, K]` (3D)	Requires 3D layout with expert dimension; padding/masking via `masked_m[e]`
Contiguous (grouped_gemm_nt_f8f8bf16_contig)	`[M, K]` (2D)	Requires 2D flat layout sorted by expert; `m_indices[m]` tells which expert each row belongs to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format #370

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format

Problem Summary

The Core Issue

Our Configuration

Questions for DeepGEMM Team

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format #370

Description

Incompatibility between DeepEP V2 ElasticBuffer non-expanded output and DeepGEMM masked/contiguous grouped GEMM input format

Problem Summary

The Core Issue

Our Configuration

Questions for DeepGEMM Team

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions