[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins#2213
Conversation
622a86c to
560bfaf
Compare
a62d150 to
a078f65
Compare
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary by up to `block_size - 1` elements. The previous formula `(-shape) % block_size` only handled the case where reads start at offset 0, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets. Simplify host-side padding to always pad by `block_size - 1`, which is the worst-case overshoot for any `pl.ds()` read regardless of starting offset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
a078f65 to
b60f9ab
Compare
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
b60f9ab to
1579c00
Compare
| out = torch.zeros([B], dtype=data.dtype, device=data.device) | ||
| for seg in hl.grid(B): | ||
| acc = hl.zeros([1], dtype=data.dtype) | ||
| for tile in hl.tile(3, 128 + 3): |
There was a problem hiding this comment.
I am a bit surprised by this test case: does this create a pl.ds expr?
Also, if we are explicitly reading out of bound data in the helion kernel, shouldn't we throw an error? This test can pass because 3 is small.
There was a problem hiding this comment.
Ah, I agree with you that hl.tile(3, 128 + 3) is a confusing test, because it seems to be explicitly requested OOB access. I updated the PR to use hl.tile(3, 128, block_size=16) instead. Here, the final tile still goes out-of-bound, but it is up to helion compiler to apply the correct padding/masking, which is what this PR is trying to fix.
does this create a pl.ds expr?
Yes, the tile is across (3, 128), so our compiler recognizes that this is tiling across the full dim of the tensor, so that dimension is not tiled via BlockSpec, and we are using pl.ds to do sliced access of each tile. This gist contains the full generated Pallas code, in case you are curious
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
1579c00 to
ac3209e
Compare
| from helion.language.memory_ops import _record_pad_info | ||
|
|
||
| _record_pad_info(state, tensor, tensor_dim, block_id) | ||
| extra_pad = _loop_begin_extra_pad(block_id, state) |
There was a problem hiding this comment.
It seems to me that we can put the logic of extra_pad into _record_pad_info so we only need a single call here.
There was a problem hiding this comment.
I just gave this a try,but it's not straightforward because _record_pad_info is called from three different contexts with different amounts of loop state available:
_ds_exprincodegen.py— theDeviceLoopStateis registered inactive_device_loopsandLoopDimInfo.begin_expris set, so
_loop_begin_extra_padworks correctly here._make_block_specin_codegen_emit_pipeline— called before theEmitPipelineLoopStateis added toactive_device_loops, and its
LoopDimInfodoesn't setbegin_expr._build_hbm_dma_slicein_codegen_fori_loop— same issue as (2).
For (2) and (3), the begin info only exists as codegen-level string expressions (begin_exprs) in the enclosing scope, not in LoopDimInfo. To make _record_pad_info self-contained, we'd need to either propagate begin_expr into the LoopDimInfo for emit_pipeline/fori_loop AND register the loop state earlier, or pass the begin info through a different channel — both add more complexity than the current approach. So I'd prefer to leave this as is
There was a problem hiding this comment.
I see. Thanks for giving it a try!
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
When `pl.ds(offset, block_size)` reads into a tensor, the last block can overshoot the tensor boundary. The previous padding formula `(-shape) % block_size` only accounted for rounding up to a block boundary, but tiles from `hl.tile(start, end)` with non-zero `start` can begin at arbitrary offsets, requiring additional headroom. Pass an `extra_pad` value through `_record_pad_info` → `_compute_pad_info` → `_ds_pad_dims` so the launcher pads by `(-shape) % block_size + extra_pad`. The extra_pad is: - 0 when the loop starts at offset 0 - `begin % block_size` for a provably constant begin - `block_size - 1` for a data-dependent begin Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> stack-info: PR: #2213, branch: AmesingFlank/stack/35
ac3209e to
8312720
Compare
Stacked PRs:
[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins
When
pl.ds(offset, block_size)reads into a tensor, the last blockcan overshoot the tensor boundary. The previous padding formula
(-shape) % block_sizeonly accounted for rounding up to a blockboundary, but tiles from
hl.tile(start, end)with non-zerostartcan begin at arbitrary offsets, requiring additional headroom.
Pass an
extra_padvalue through_record_pad_info→_compute_pad_info→_ds_pad_dimsso the launcher pads by(-shape) % block_size + extra_pad. The extra_pad is:begin % block_sizefor a provably constant beginblock_size - 1for a data-dependent beginCo-Authored-By: Claude Opus 4.6 noreply@anthropic.com