[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router by JacobHu-NV · Pull Request #12802 · NVIDIA/TensorRT-LLM

JacobHu-NV · 2026-04-07T08:32:51Z

Summary

Add green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams,
create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that create
cuGreenCtxStreamCreate-bound streams directly via the Driver API. Unlike streams
created inside GreenContext.set_context(), these survive CUDA Graph capture/replay
with their SM partition intact.
Add DenseGEMMGCSMRunner (TunableRunner) in fused_moe_densegemm.py that sweeps
FC1 SM count candidates via the AutoTuner framework to find the optimal SM split for
FC1/Router overlap.
Extend DenseGEMMFusedMoE with a _gc_stream_pool pre-created at init time for all
candidate SM splits, enabling CUDA-graph-safe autotuning without re-creating
GreenContext streams at runtime.
Add sm_budget parameter to CuteDSLNVFP4DenseGemmSwigluRunner in
cute_dsl_custom_ops.py (excluded from unique_id so inner tuning is shared across
GC splits); register new custom ops cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell,
cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell.

Motivation

The DenseGEMM MoE path overlaps FC1 and Router GEMM to hide router latency. Previous
attempts used soft sm_budget hints (max_active_clusters) which don't prevent SM
contention at the hardware level. CUDA GreenContext provides true hardware SM isolation —
FC1 and Router CTAs are dispatched to disjoint SM partitions with no interference.

…M FC1+Router Introduce hardware-level SM isolation via CUDA GreenContext so that the FC1 GEMM and Router GEMM can execute truly in parallel without SM contention in the DenseGEMM MoE path. Key changes: - green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams, create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that bypass PyTorch's GreenContext API to create cuGreenCtxStreamCreate- bound streams. These streams survive CUDA Graph capture/replay with their SM partition intact, unlike streams created inside GreenContext.set_context(). - fused_moe_densegemm.py: Add DenseGEMMGCSMRunner (TunableRunner) that sweeps FC1 SM count candidates via the AutoTuner framework to find the optimal SM split for FC1/Router overlap. Extend DenseGEMMFusedMoE with a _gc_stream_pool pre-created at init time for all candidate SM splits, enabling CUDA-graph-safe autotuning. - cute_dsl_custom_ops.py: Add sm_budget parameter to CuteDSLNVFP4DenseGemmSwigluRunner (excluded from unique_id so inner tuning is shared across GC splits); register new custom ops cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell, cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

github-actions bot assigned JacobHu-NV Apr 7, 2026

JacobHu-NV force-pushed the pr/densegemm-as-moe-overlap branch from ce51764 to e9f45af Compare April 7, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802
JacobHu-NV wants to merge 1 commit intoNVIDIA:mainfrom
JacobHu-NV:pr/densegemm-as-moe-overlap

JacobHu-NV commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JacobHu-NV commented Apr 7, 2026

Summary

Motivation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant