{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423
Draft
yushangdi wants to merge 2 commits into
Draft
{experimental stack not ready for review}[cute] add fp8 dtype mapping and tcgen05 MmaF8F6F4Op capability probe#2423yushangdi wants to merge 2 commits into
yushangdi wants to merge 2 commits into
Conversation
Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. [ghstack-poisoned]
…n05 MmaF8F6F4Op capability probe" Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. [ghstack-poisoned]
yushangdi
pushed a commit
that referenced
this pull request
May 14, 2026
Two pieces of plumbing for upcoming fp8 cute-backend support, kept together because each is too small to merit a standalone review. 1. helion/runtime/__init__.py: extend _torch_dtype_to_cutlass with torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 -> cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder without hitting BackendUnsupported at the type-mapping layer. 2. helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated in cutlass-dsl in favor of this one). The probe is gracefully False with a diagnostic error string when cutlass/cute is missing or the device doesn't support the atom. tcgen05 is surfaced in supported_impls when either the f16/bf16 or f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports the tcgen05 impl as available. No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels will still fail downstream until the codegen PR wires fp8 through the _has_mma_operands gate and MMA atom selection in cute_mma.py. Authored with Claude. ghstack-source-id: 2d246b0 Pull Request resolved: #2423
Contributor
|
cc @oulgen |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Two pieces of plumbing for upcoming fp8 cute-backend support, kept together
because each is too small to merit a standalone review.
helion/runtime/init.py: extend _torch_dtype_to_cutlass with
torch.float8_e4m3fn -> cutlass.Float8E4M3FN and torch.float8_e5m2 ->
cutlass.Float8E5M2. Lets fp8 tensors get past the cute schema builder
without hitting BackendUnsupported at the type-mapping layer.
helion/_compiler/cute/mma_support.py: add a tcgen05 fp8/f6/f4 capability
probe to CuteMmaSupport, mirroring the existing f16/bf16 probe. Uses
cutlass.cute.nvgpu.tcgen05.MmaF8F6F4Op (the older MmaFP8Op is deprecated
in cutlass-dsl in favor of this one). The probe is gracefully False with
a diagnostic error string when cutlass/cute is missing or the device
doesn't support the atom.
tcgen05 is surfaced in supported_impls when either the f16/bf16 or
f8/f6/f4 probe succeeds, so a Blackwell-only fp8 build still reports
the tcgen05 impl as available.
No MMA codegen, lowering, or autotune behavior changes. fp8 cute kernels
will still fail downstream until the codegen PR wires fp8 through the
_has_mma_operands gate and MMA atom selection in cute_mma.py.
Authored with Claude.