[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins by AmesingFlank · Pull Request #2213 · pytorch/helion

AmesingFlank · 2026-05-03T21:53:47Z

Stacked PRs:

[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins

When pl.ds(offset, block_size) reads into a tensor, the last block
can overshoot the tensor boundary. The previous padding formula
(-shape) % block_size only accounted for rounding up to a block
boundary, but tiles from hl.tile(start, end) with non-zero start
can begin at arbitrary offsets, requiring additional headroom.

Pass an extra_pad value through _record_pad_info →
_compute_pad_info → _ds_pad_dims so the launcher pads by
(-shape) % block_size + extra_pad. The extra_pad is:

0 when the loop starts at offset 0
begin % block_size for a provably constant begin
block_size - 1 for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary by up to `block_size - 1` elements. The previous formula `(-shape) % block_size` only handled the case where reads start at offset 0, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets. Simplify host-side padding to always pad by `block_size - 1`, which is the worst-case overshoot for any `pl.ds()` read regardless of starting offset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35

When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35

norx1991 · 2026-05-04T17:36:28Z

+            out = torch.zeros([B], dtype=data.dtype, device=data.device)
+            for seg in hl.grid(B):
+                acc = hl.zeros([1], dtype=data.dtype)
+                for tile in hl.tile(3, 128 + 3):


I am a bit surprised by this test case: does this create a pl.ds expr?
Also, if we are explicitly reading out of bound data in the helion kernel, shouldn't we throw an error? This test can pass because 3 is small.

Ah, I agree with you that hl.tile(3, 128 + 3) is a confusing test, because it seems to be explicitly requested OOB access. I updated the PR to use hl.tile(3, 128, block_size=16) instead. Here, the final tile still goes out-of-bound, but it is up to helion compiler to apply the correct padding/masking, which is what this PR is trying to fix.

does this create a pl.ds expr?

Yes, the tile is across (3, 128), so our compiler recognizes that this is tiling across the full dim of the tensor, so that dimension is not tiled via BlockSpec, and we are using pl.ds to do sliced access of each tile. This gist contains the full generated Pallas code, in case you are curious

When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35

norx1991 · 2026-05-04T18:09:17Z

        from helion.language.memory_ops import _record_pad_info

-        _record_pad_info(state, tensor, tensor_dim, block_id)
+        extra_pad = _loop_begin_extra_pad(block_id, state)


It seems to me that we can put the logic of extra_pad into _record_pad_info so we only need a single call here.

I just gave this a try,but it's not straightforward because _record_pad_info is called from three different contexts with different amounts of loop state available:

_ds_expr in codegen.py — the DeviceLoopState is registered in active_device_loops and LoopDimInfo.begin_expr is set, so
_loop_begin_extra_pad works correctly here.

_make_block_spec in _codegen_emit_pipeline — called before the EmitPipelineLoopState is added to active_device_loops, and its
LoopDimInfo doesn't set begin_expr.

_build_hbm_dma_slice in _codegen_fori_loop — same issue as (2).

For (2) and (3), the begin info only exists as codegen-level string expressions (begin_exprs) in the enclosing scope, not in LoopDimInfo. To make _record_pad_info self-contained, we'd need to either propagate begin_expr into the LoopDimInfo for emit_pipeline/fori_loop AND register the loop state earlier, or pass the begin info through a different channel — both add more complexity than the current approach. So I'd prefer to leave this as is

I see. Thanks for giving it a try!

When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 3, 2026

AmesingFlank force-pushed the AmesingFlank/stack/34 branch from 622a86c to 560bfaf Compare May 3, 2026 21:54

AmesingFlank force-pushed the AmesingFlank/stack/35 branch from a62d150 to a078f65 Compare May 3, 2026 21:54

AmesingFlank marked this pull request as draft May 3, 2026 23:14

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 3, 2026 23:14

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 3, 2026 23:14

AmesingFlank marked this pull request as ready for review May 3, 2026 23:14

AmesingFlank marked this pull request as draft May 4, 2026 01:46

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 01:46

AmesingFlank force-pushed the AmesingFlank/stack/35 branch from a078f65 to b60f9ab Compare May 4, 2026 01:46

AmesingFlank changed the title ~~[Pallas] Add extra host-side padding for data-dependent tile loops~~ [Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins May 4, 2026

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 01:46

AmesingFlank marked this pull request as ready for review May 4, 2026 01:46

AmesingFlank marked this pull request as draft May 4, 2026 01:52

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 01:52

AmesingFlank force-pushed the AmesingFlank/stack/35 branch from b60f9ab to 1579c00 Compare May 4, 2026 01:52

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 01:52

AmesingFlank marked this pull request as ready for review May 4, 2026 01:52

AmesingFlank requested review from jansel, norx1991 and oulgen May 4, 2026 15:08

AmesingFlank marked this pull request as draft May 4, 2026 16:44

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 16:44

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 16:45

AmesingFlank mentioned this pull request May 4, 2026

[Pallas] Fix pre-broadcasting transformation bug when non-broadcast dims exceed PRE_BROADCAST_SIZE #2223

Merged

AmesingFlank marked this pull request as ready for review May 4, 2026 16:45

norx1991 reviewed May 4, 2026

View reviewed changes

AmesingFlank marked this pull request as draft May 4, 2026 17:54

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 17:54

AmesingFlank force-pushed the AmesingFlank/stack/35 branch from 1579c00 to ac3209e Compare May 4, 2026 17:55

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 17:55

AmesingFlank marked this pull request as ready for review May 4, 2026 17:55

norx1991 reviewed May 4, 2026

View reviewed changes

norx1991 approved these changes May 4, 2026

View reviewed changes

AmesingFlank marked this pull request as draft May 4, 2026 18:54

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 18:54

AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 18:54

AmesingFlank marked this pull request as ready for review May 4, 2026 18:55

AmesingFlank marked this pull request as draft May 4, 2026 19:47

AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 19:47

AmesingFlank force-pushed the AmesingFlank/stack/35 branch from ac3209e to 8312720 Compare May 4, 2026 19:48

AmesingFlank marked this pull request as ready for review May 4, 2026 19:48

AmesingFlank merged commit 6f7f165 into main May 4, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins#2213

[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins#2213
AmesingFlank merged 1 commit into
mainfrom
AmesingFlank/stack/35

AmesingFlank commented May 3, 2026 •

edited

Loading

Uh oh!

norx1991 May 4, 2026

Uh oh!

AmesingFlank May 4, 2026

Uh oh!

norx1991 May 4, 2026

Uh oh!

AmesingFlank May 4, 2026

Uh oh!

norx1991 May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AmesingFlank commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!