Skip to content

[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins#2213

Merged
AmesingFlank merged 1 commit into
mainfrom
AmesingFlank/stack/35
May 4, 2026
Merged

[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins#2213
AmesingFlank merged 1 commit into
mainfrom
AmesingFlank/stack/35

Conversation

@AmesingFlank
Copy link
Copy Markdown
Contributor

@AmesingFlank AmesingFlank commented May 3, 2026

Stacked PRs:


[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins

When pl.ds(offset, block_size) reads into a tensor, the last block
can overshoot the tensor boundary. The previous padding formula
(-shape) % block_size only accounted for rounding up to a block
boundary, but tiles from hl.tile(start, end) with non-zero start
can begin at arbitrary offsets, requiring additional headroom.

Pass an extra_pad value through _record_pad_info
_compute_pad_info_ds_pad_dims so the launcher pads by
(-shape) % block_size + extra_pad. The extra_pad is:

  • 0 when the loop starts at offset 0
  • begin % block_size for a provably constant begin
  • block_size - 1 for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 3, 2026
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/34 branch from 622a86c to 560bfaf Compare May 3, 2026 21:54
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/35 branch from a62d150 to a078f65 Compare May 3, 2026 21:54
@AmesingFlank AmesingFlank marked this pull request as draft May 3, 2026 23:14
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 3, 2026 23:14
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 3, 2026 23:14
@AmesingFlank AmesingFlank marked this pull request as ready for review May 3, 2026 23:14
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary by up to `block_size - 1` elements.
The previous formula `(-shape) % block_size` only handled the case
where reads start at offset 0, but tiles from `hl.tile(start, end)`
with non-zero `start` can begin at arbitrary offsets.

Simplify host-side padding to always pad by `block_size - 1`, which
is the worst-case overshoot for any `pl.ds()` read regardless of
starting offset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 01:46
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 01:46
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/35 branch from a078f65 to b60f9ab Compare May 4, 2026 01:46
@AmesingFlank AmesingFlank changed the title [Pallas] Add extra host-side padding for data-dependent tile loops [Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins May 4, 2026
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 01:46
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 01:46
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 01:52
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 01:52
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/35 branch from b60f9ab to 1579c00 Compare May 4, 2026 01:52
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 01:52
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 01:52
@AmesingFlank AmesingFlank requested review from jansel, norx1991 and oulgen May 4, 2026 15:08
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 16:44
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 16:44
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 16:45
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 16:45
Comment thread test/test_pallas.py Outdated
out = torch.zeros([B], dtype=data.dtype, device=data.device)
for seg in hl.grid(B):
acc = hl.zeros([1], dtype=data.dtype)
for tile in hl.tile(3, 128 + 3):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised by this test case: does this create a pl.ds expr?
Also, if we are explicitly reading out of bound data in the helion kernel, shouldn't we throw an error? This test can pass because 3 is small.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I agree with you that hl.tile(3, 128 + 3) is a confusing test, because it seems to be explicitly requested OOB access. I updated the PR to use hl.tile(3, 128, block_size=16) instead. Here, the final tile still goes out-of-bound, but it is up to helion compiler to apply the correct padding/masking, which is what this PR is trying to fix.

does this create a pl.ds expr?

Yes, the tile is across (3, 128), so our compiler recognizes that this is tiling across the full dim of the tensor, so that dimension is not tiled via BlockSpec, and we are using pl.ds to do sliced access of each tile. This gist contains the full generated Pallas code, in case you are curious

AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 17:54
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 17:54
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/35 branch from 1579c00 to ac3209e Compare May 4, 2026 17:55
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 17:55
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 17:55
from helion.language.memory_ops import _record_pad_info

_record_pad_info(state, tensor, tensor_dim, block_id)
extra_pad = _loop_begin_extra_pad(block_id, state)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we can put the logic of extra_pad into _record_pad_info so we only need a single call here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave this a try,but it's not straightforward because _record_pad_info is called from three different contexts with different amounts of loop state available:

  1. _ds_expr in codegen.py — the DeviceLoopState is registered in active_device_loops and LoopDimInfo.begin_expr is set, so
    _loop_begin_extra_pad works correctly here.
  2. _make_block_spec in _codegen_emit_pipeline — called before the EmitPipelineLoopState is added to active_device_loops, and its
    LoopDimInfo doesn't set begin_expr.
  3. _build_hbm_dma_slice in _codegen_fori_loop — same issue as (2).

For (2) and (3), the begin info only exists as codegen-level string expressions (begin_exprs) in the enclosing scope, not in LoopDimInfo. To make _record_pad_info self-contained, we'd need to either propagate begin_expr into the LoopDimInfo for emit_pipeline/fori_loop AND register the loop state earlier, or pass the begin info through a different channel — both add more complexity than the current approach. So I'd prefer to leave this as is

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for giving it a try!

AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 18:54
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 18:54
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/34 May 4, 2026 18:54
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 18:55
AmesingFlank added a commit that referenced this pull request May 4, 2026
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block
can overshoot the tensor boundary.  The previous padding formula
`(-shape) % block_size` only accounted for rounding up to a block
boundary, but tiles from `hl.tile(start, end)` with non-zero `start`
can begin at arbitrary offsets, requiring additional headroom.

Pass an `extra_pad` value through `_record_pad_info` →
`_compute_pad_info` → `_ds_pad_dims` so the launcher pads by
`(-shape) % block_size + extra_pad`.  The extra_pad is:
- 0 when the loop starts at offset 0
- `begin % block_size` for a provably constant begin
- `block_size - 1` for a data-dependent begin

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stack-info: PR: #2213, branch: AmesingFlank/stack/35
@AmesingFlank AmesingFlank marked this pull request as draft May 4, 2026 19:47
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/34 to main May 4, 2026 19:47
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/35 branch from ac3209e to 8312720 Compare May 4, 2026 19:48
@AmesingFlank AmesingFlank marked this pull request as ready for review May 4, 2026 19:48
@AmesingFlank AmesingFlank merged commit 6f7f165 into main May 4, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants