Skip to content

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802

Draft
JacobHu-NV wants to merge 1 commit intoNVIDIA:mainfrom
JacobHu-NV:pr/densegemm-as-moe-overlap
Draft

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802
JacobHu-NV wants to merge 1 commit intoNVIDIA:mainfrom
JacobHu-NV:pr/densegemm-as-moe-overlap

Conversation

@JacobHu-NV
Copy link
Copy Markdown

Summary

  • Add green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams,
    create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that create
    cuGreenCtxStreamCreate-bound streams directly via the Driver API. Unlike streams
    created inside GreenContext.set_context(), these survive CUDA Graph capture/replay
    with their SM partition intact.
  • Add DenseGEMMGCSMRunner (TunableRunner) in fused_moe_densegemm.py that sweeps
    FC1 SM count candidates via the AutoTuner framework to find the optimal SM split for
    FC1/Router overlap.
  • Extend DenseGEMMFusedMoE with a _gc_stream_pool pre-created at init time for all
    candidate SM splits, enabling CUDA-graph-safe autotuning without re-creating
    GreenContext streams at runtime.
  • Add sm_budget parameter to CuteDSLNVFP4DenseGemmSwigluRunner in
    cute_dsl_custom_ops.py (excluded from unique_id so inner tuning is shared across
    GC splits); register new custom ops cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell,
    cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell.

Motivation

The DenseGEMM MoE path overlaps FC1 and Router GEMM to hide router latency. Previous
attempts used soft sm_budget hints (max_active_clusters) which don't prevent SM
contention at the hardware level. CUDA GreenContext provides true hardware SM isolation —
FC1 and Router CTAs are dispatched to disjoint SM partitions with no interference.

…M FC1+Router

Introduce hardware-level SM isolation via CUDA GreenContext so that the
FC1 GEMM and Router GEMM can execute truly in parallel without SM
contention in the DenseGEMM MoE path.

Key changes:
- green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams,
  create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that
  bypass PyTorch's GreenContext API to create cuGreenCtxStreamCreate-
  bound streams.  These streams survive CUDA Graph capture/replay with
  their SM partition intact, unlike streams created inside
  GreenContext.set_context().
- fused_moe_densegemm.py: Add DenseGEMMGCSMRunner (TunableRunner) that
  sweeps FC1 SM count candidates via the AutoTuner framework to find the
  optimal SM split for FC1/Router overlap.  Extend DenseGEMMFusedMoE with
  a _gc_stream_pool pre-created at init time for all candidate SM splits,
  enabling CUDA-graph-safe autotuning.
- cute_dsl_custom_ops.py: Add sm_budget parameter to
  CuteDSLNVFP4DenseGemmSwigluRunner (excluded from unique_id so inner
  tuning is shared across GC splits); register new custom ops
  cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell,
  cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
@JacobHu-NV JacobHu-NV force-pushed the pr/densegemm-as-moe-overlap branch from ce51764 to e9f45af Compare April 7, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant