{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe by yushangdi · Pull Request #2423 · pytorch/helion

yushangdi · 2026-05-14T00:29:56Z

Stack from ghstack (oldest at bottom):

-> {experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe #2423

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.

helion/runtime/init.py: extend _torch_dtype_to_cutlass with
torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
without hitting BackendUnsupported at the type-mapping layer.
helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
in cutlass-dsl in favor of this one). The probe is gracefully False with
a diagnostic error string when cutlass/cute is missing or the device
doesn't support the atom.

tcgen05 is surfaced in supported_impls when either the f16/bf16 or
f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
the tcgen05 impl as available.

No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.

Authored with Claude.

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. [ghstack-poisoned]

…n05 MmaF8F6F4Op capability probe" Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. [ghstack-poisoned]

Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. ghstack-source-id: 2d246b0 Pull Request resolved: #2423

jansel · 2026-05-14T03:20:58Z

cc @oulgen

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 14, 2026

yushangdi marked this pull request as draft May 14, 2026 00:31

yushangdi mentioned this pull request May 14, 2026

[cute] add fp8 dtypes to _torch_dtype_to_cutlass mapping #2422

Closed

yushangdi changed the title ~~[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe~~ {experimental stack }[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe May 14, 2026

yushangdi changed the title ~~{experimental stack }[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe~~ {experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423

{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423
yushangdi wants to merge 2 commits into
gh/yushangdi/2/basefrom
gh/yushangdi/2/head

yushangdi commented May 14, 2026 •

edited

Loading

Uh oh!

jansel commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yushangdi commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yushangdi commented May 14, 2026 •

edited

Loading