Skip to content

{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423

Draft
yushangdi wants to merge 2 commits into
gh/yushangdi/2/basefrom
gh/yushangdi/2/head
Draft

{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423
yushangdi wants to merge 2 commits into
gh/yushangdi/2/basefrom
gh/yushangdi/2/head

Conversation

@yushangdi
Copy link
Copy Markdown

@yushangdi yushangdi commented May 14, 2026

Stack from ghstack (oldest at bottom):

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.

  1. helion/runtime/init.py: extend _torch_dtype_to_cutlass with
    torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
    cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
    without hitting BackendUnsupported at the type-mapping layer.

  2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
    probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
    cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
    in cutlass-dsl in favor of this one). The probe is gracefully False with
    a diagnostic error string when cutlass/cute is missing or the device
    doesn't support the atom.

    tcgen05 is surfaced in supported_impls when either the f16/bf16 or
    f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
    the tcgen05 impl as available.

No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.

Authored with Claude.

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.

1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with
   torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
   cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
   without hitting BackendUnsupported at the type-mapping layer.

2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
   probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
   cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
   in cutlass-dsl in favor of this one). The probe is gracefully False with
   a diagnostic error string when cutlass/cute is missing or the device
   doesn't support the atom.

   tcgen05 is surfaced in supported_impls when either the f16/bf16 or
   f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
   the tcgen05 impl as available.

No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.

Authored with Claude.

[ghstack-poisoned]
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 14, 2026
@yushangdi yushangdi marked this pull request as draft May 14, 2026 00:31
@yushangdi yushangdi changed the title [cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe {experimental stack }[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe May 14, 2026
…n05 MmaF8F6F4Op capability probe"

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.

1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with
   torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
   cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
   without hitting BackendUnsupported at the type-mapping layer.

2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
   probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
   cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
   in cutlass-dsl in favor of this one). The probe is gracefully False with
   a diagnostic error string when cutlass/cute is missing or the device
   doesn't support the atom.

   tcgen05 is surfaced in supported_impls when either the f16/bf16 or
   f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
   the tcgen05 impl as available.

No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.

Authored with Claude.

[ghstack-poisoned]
yushangdi pushed a commit that referenced this pull request May 14, 2026
Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.

1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with
   torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
   cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
   without hitting BackendUnsupported at the type-mapping layer.

2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
   probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
   cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
   in cutlass-dsl in favor of this one). The probe is gracefully False with
   a diagnostic error string when cutlass/cute is missing or the device
   doesn't support the atom.

   tcgen05 is surfaced in supported_impls when either the f16/bf16 or
   f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
   the tcgen05 impl as available.

No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.

Authored with Claude.

ghstack-source-id: 2d246b0
Pull Request resolved: #2423
@jansel
Copy link
Copy Markdown
Contributor

jansel commented May 14, 2026

cc @oulgen

@yushangdi yushangdi changed the title {experimental stack }[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe {experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants