Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate by choijon5 · Pull Request #2467 · pytorch/helion

choijon5 · 2026-05-16T15:46:17Z

Two perf-dashboard fixes

rope-bwd (B200 + H100, all 8 shapes)
Was failing every shape with:

CUDA error: operation failed due to a previous error during capture
RuntimeError: Offset increment outside graph capture encountered unexpectedly.

Same cause as the existing grouped_gemm / layer_norm-bwd / rms_norm-bwd configs: tritonbench's torch_compile backward recompiles during CUDA graph capture. Remove CG capture.

gdn_fwd_h (B200 + H100, batch=16 shapes)
At batch=16, bf16-reduction-order drift in the kernels, so both Triton and Helion started failing accuracy

batch>=16: apply rtol=0.5, atol=2.0, the loose window the dashboard previously used

Two perf-dashboard fixes: 1. rope-bwd was failing every shape on B200 + H100 with "Offset increment outside graph capture" — tritonbench's torch_compile rope-bwd recompiles during CUDA graph capture. Disable cudagraph for rope-bwd, mirroring the existing grouped_gemm / layer_norm-bwd / rms_norm-bwd configs. 2. gdn_fwd_h accuracy was failing for batch=16 on B200 (seq>=2048) and H100 (all seqs) for *both* Triton and Helion. PR pytorch#2455 replaced the rtol=0.5/atol=2.0 tolerance with precision=tf32, but tf32 doesn't address bf16-accumulation drift in the kernels. Restore a shape-conditional accuracy patch: keep the strict default for batch=1 (always passes), and only widen to rtol=0.5/atol=2.0 for batch>=16 where the 16x longer reductions diverge from eager.

Helion autotune at full effort consistently fails on the (64, 64, 1, 8192, 256, 64, 128) chunk_state shape on H100 with "No working config found": all 100 initial-population configs error out at runtime in ~56s. The same kernel + shape works fine on a freshly-cleared GPU (manual configs all run; full-effort autotune succeeds well into generation 4 with ~98% of sampled configs OK). The CI failure correlates with carryover from the 5 prior chunk_state shapes (~1.5h of autotune leaves behind cached buffers / JIT state / tritonbench-retained tensors), pushing the largest 4.3 GB-input shape past whatever resource limit the autotune subprocesses are hitting. Match the existing chunk_scan gating: skip the shape on devices with < 100 GB free. B200 (192 GB) keeps running it; H100 (96 GB total) skips and the rest of the chunk_state shapes still report. Rename the constants to MAMBA2_LARGE_SHAPE since they now apply to both chunk_scan and chunk_state.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 16, 2026

choijon5 requested review from jansel, karthickai and oulgen May 16, 2026 15:56

jansel approved these changes May 16, 2026

View reviewed changes

choijon5 merged commit 2e41857 into pytorch:main May 17, 2026
32 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate#2467

Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate#2467
choijon5 merged 2 commits into
pytorch:mainfrom
choijon5:fix-perf-ci-failures

choijon5 commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

choijon5 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

choijon5 commented May 16, 2026 •

edited

Loading