Skip to content

Add deepseek_v3_2_prefill_back_scope1 & 2 & 12#113

Open
xzhxzhxzh123 wants to merge 3 commits intohw-native-sys:mainfrom
xzhxzhxzh123:main
Open

Add deepseek_v3_2_prefill_back_scope1 & 2 & 12#113
xzhxzhxzh123 wants to merge 3 commits intohw-native-sys:mainfrom
xzhxzhxzh123:main

Conversation

@xzhxzhxzh123
Copy link
Copy Markdown
Collaborator

@xzhxzhxzh123 xzhxzhxzh123 commented Apr 15, 2026

Add deepseek_v3_2_prefill_back_scope1 & 2 & 12
精度全通

Add deepseek_v3_2_prefill_back_scope1&2
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds three new DeepSeek V3.2 PyPTO example programs implementing prefill-back scopes: scope1 (combine + matmul + residual), scope12 (scope1 projection + post-RMSNorm), and scope2 (post-RMSNorm), each with tensor specs, PyTorch golden references, and compile/run CLIs.

Changes

Cohort / File(s) Summary
Prefill-back Scope1
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py
New program builder build_deepseek_v3_2_prefill_back_scope1_program, tensor spec generator, golden_scope1 reference, and compile_and_run driver. Implements tiled BF16 combine → matmul → FP32 residual-add pipeline with CLI.
Prefill-back Scope12
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py
New program builder build_deepseek_v3_2_prefill_back_scope12_program, tensor specs, golden_scope12, and compile_and_run. Implements combine/projection plus a post-normalization (RMS) phase, returns BF16 post_norm.
Prefill-back Scope2 (RMSNorm)
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py
New program builder build_deepseek_v3_2_prefill_back_scope2_program, tensor spec builder, golden_scope2, and compile_and_run. Implements per-token RMS stat computation, inv-rms, and gated normalization in tiled BF16 with padding logic and CLI.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I stitched the scopes with careful hops,
Combining tiles and counting tops,
From BF16 dances to FP32 light,
Post-RMS hums into the night,
Hooray — the prefill runs just right! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely summarizes the main change: adding three new DeepSeek V3.2 example modules (scope1, scope2, and scope12).
Description check ✅ Passed The pull request description is related to the changeset. It mentions the three scopes being added (scope1, scope2, and scope12), and notes precision concerns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces implementation scripts for DeepSeek V3.2 prefill back-end scopes, specifically handling projection, residual, and RMSNorm operations. The review feedback identifies critical race conditions due to shared tensors in parallel loops and performance inefficiencies from redundant memory operations and global memory round-trips. Additionally, the reviewer noted that partial token blocks are not correctly handled, requiring the addition of shape validation and padding logic.

Comment on lines +61 to +63
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)

for b in pl.parallel(0, BATCH_CFG, 1):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The tensor combine_local is created outside the pl.parallel loop. In pypto, tensors created outside a parallel loop are shared across all parallel iterations. Since each iteration (representing a batch element) writes to and reads from this tensor independently, this will cause a race condition. Move the tensor creation inside the pl.parallel loop to ensure it is thread-local.

Suggested change
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)
for b in pl.parallel(0, BATCH_CFG, 1):
for b in pl.parallel(0, BATCH_CFG, 1):
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)

Comment on lines +71 to +73
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)

for b in pl.parallel(0, BATCH_CFG, 1):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The tensor combine_local is created outside the pl.parallel loop, which will lead to race conditions across parallel iterations. Move the tensor creation inside the pl.parallel loop.

Suggested change
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)
for b in pl.parallel(0, BATCH_CFG, 1):
for b in pl.parallel(0, BATCH_CFG, 1):
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)

Comment on lines +71 to +78
with pl.incore():
for cb in pl.range(COMBINE_BLOCKS):
c0 = cb * COMBINE_CHUNK
chunk = pl.reshape(
pl.slice(combine_buf, [1, 1, TOK_TILE, COMBINE_CHUNK], [node_id, b, p0, c0]),
[TOK_TILE, COMBINE_CHUNK],
)
combine_local = pl.assemble(combine_local, chunk, [0, c0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The population of combine_local is currently inside the ob loop (over Q_OUT_BLOCKS). However, the content of combine_local only depends on the batch index b and the token block p0. Moving this logic outside the ob loop will avoid redundant memory reads and significantly improve performance. Additionally, valid_shape and fillpad should be used to correctly handle partial token blocks at the end of sequences and prevent data leakage from previous iterations.

Comment on lines +81 to +88
with pl.incore():
for cb in pl.range(COMBINE_BLOCKS):
c0 = cb * COMBINE_CHUNK
chunk = pl.reshape(
pl.slice(combine_buf, [1, 1, TOK_TILE, COMBINE_CHUNK], [node_id, b, p0, c0]),
[TOK_TILE, COMBINE_CHUNK],
)
combine_local = pl.assemble(combine_local, chunk, [0, c0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Populating combine_local inside the ob loop is redundant. Move this logic outside the ob loop to improve efficiency. Also, ensure valid_shape and fillpad are used to handle partial tiles correctly.

Comment on lines +117 to +120
x_chunk = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0]),
[TOK_TILE, K_CHUNK]
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pl.slice operation here is missing the valid_shape parameter, and the resulting chunk is not padded. Since valid_tok is calculated at line 110, it should be used here to correctly handle partial token blocks, similar to the implementation in deepseek_v3_2_prefill_back_scope2.py.

Suggested change
x_chunk = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0]),
[TOK_TILE, K_CHUNK]
)
x_chunk_raw = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0],
valid_shape=[1, valid_tok, K_CHUNK]),
[TOK_TILE, K_CHUNK]
)
x_chunk = pl.fillpad(x_chunk_raw, pad_value=pl.PadValue.zero)

Comment on lines +129 to +132
x_chunk = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0]),
[TOK_TILE, K_CHUNK]
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Missing valid_shape and fillpad for partial tile handling. Use valid_tok to mask the slice and pad the resulting chunk.

Suggested change
x_chunk = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0]),
[TOK_TILE, K_CHUNK]
)
x_chunk_raw = pl.reshape(
pl.slice(resid1, [1, TOK_TILE, K_CHUNK], [b, p0, k0],
valid_shape=[1, valid_tok, K_CHUNK]),
[TOK_TILE, K_CHUNK]
)
x_chunk = pl.fillpad(x_chunk_raw, pad_value=pl.PadValue.zero)

Comment on lines +88 to +95
resid1 = pl.assemble(resid1, o_acc, [b, p0, o0])
with pl.incore():
proj = pl.slice(resid1, [1, TOK_TILE, Q_OUT_CHUNK], [b, p0, o0])
resid = pl.cast(
pl.slice(hidden_states, [1, TOK_TILE, Q_OUT_CHUNK], [b, p0, o0]),
target_type=pl.FP32,
)
resid1 = pl.assemble(resid1, pl.add(proj, resid), [b, p0, o0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation performs an inefficient round-trip to global memory. It assembles the intermediate matmul result o_acc into the output tensor resid1, then immediately slices it back to perform the residual addition. It is much more efficient to perform the addition in-core before assembling the final result to the output tensor.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py (1)

199-201: Inconsistent CLI defaults compared to scope1/scope12.

This file uses full production-size defaults (BATCH=16, MAX_SEQ=4096, HIDDEN=7168), while scope1 and scope12 use smaller defaults (batch=2, max_seq=64, hidden=512) for quick testing. Consider aligning for consistency.

♻️ Proposed fix
-    parser.add_argument("--batch", type=int, default=BATCH)
-    parser.add_argument("--max-seq", type=int, default=MAX_SEQ)
-    parser.add_argument("--hidden", type=int, default=HIDDEN)
+    parser.add_argument("--batch", type=int, default=2)
+    parser.add_argument("--max-seq", type=int, default=64)
+    parser.add_argument("--hidden", type=int, default=512)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py` around
lines 199 - 201, The CLI defaults are set to production-size constants (BATCH,
MAX_SEQ, HIDDEN) in parser.add_argument calls; change these to the smaller
quick-test defaults used in scope1/scope12 (batch=2, max_seq=64, hidden=512) so
parser.add_argument("--batch", "--max-seq", and "--hidden" use those values
instead of BATCH/ MAX_SEQ/ HIDDEN; update any related help text or docstring
references if present to reflect the new test-friendly defaults.
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py (3)

110-110: Unused variable valid_tok.

valid_tok is computed but never used in the post-normalization phase. This appears to be dead code—either remove it or use it for padding as done in scope2.

♻️ Proposed fix
-                    valid_tok = pl.min(TOK_TILE, seq_len_b - p0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py` at line
110, Remove or use the dead variable `valid_tok` computed as `valid_tok =
pl.min(TOK_TILE, seq_len_b - p0)` in deepseek_v3_2_prefill_back_scope12.py:
either delete this unused assignment or apply it to the post-normalization
padding step (similar to the logic in `scope2`) so padding respects the
remaining valid token count; locate the computation of `valid_tok` and mirror
the `scope2` padding behavior by using `valid_tok` when constructing or trimming
the padded tensor instead of leaving it unused.

107-123: Inconsistent padding behavior compared to scope2.

The post-normalization phase in scope12 computes RMS statistics without zero-padding invalid tokens beyond seq_len_b, while scope2 explicitly uses fillpad with valid_shape for partial tiles. This could cause numerical differences when the sequence length is not a multiple of TOK_TILE.

Both the program (here) and golden_scope12 omit padding, so they're internally consistent, but this differs from scope2's approach. If this is intentional (e.g., scope12 assumes full tiles), consider adding a comment. Otherwise, align the padding behavior for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py` around
lines 107 - 123, The RMS computation in scope12 omits zero-padding for partial
TOK_TILE tiles, causing inconsistency with scope2; modify the block around
tok_blocks/tok loop so x_chunk is created using pl.fillpad (or equivalent) with
valid_shape=[valid_tok, K_CHUNK] before reshaping/row-squaring, ensuring any
tokens beyond seq_len_b are zeroed before accumulating into sq_sum (symbols to
change: the x_chunk creation that slices resid1, the pl.incore loop that builds
sq_sum across HIDDEN_BLOCKS, and the subsequent inv_rms computation);
alternatively, if the no-padding behavior is intentional, add a concise comment
in that region explaining the assumption that tiles are full to make the
divergence explicit.

224-225: Global state for _NODE_ID is acceptable for examples but fragile.

Using a module-level global for node_id works but isn't thread-safe. For example code this is fine; for production, consider passing it via the params argument to the golden function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py` around
lines 224 - 225, The module-level _NODE_ID assignment is fragile; remove the
global reliance and instead accept node_id via the golden function's params
argument: update the golden function to read node_id = params.get('node_id') (or
a named parameter) and fall back to _NODE_ID only if params lacks it for
compatibility, then use that local variable everywhere instead of the global
_NODE_ID; this ensures thread-safety while keeping backward compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py`:
- Around line 137-138: The golden function is mutating the input view
resid1_tile_full (a view of tensors["resid1"]) when it zeros rows, corrupting
resid1; fix by making an explicit copy of the slice before modification (e.g.,
clone/contiguous+clone the result of resid1[b, p0:p0 + TOK_TILE, :]) and perform
the zeroing on that cloned tensor so tensors["resid1"] is not altered.

---

Nitpick comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py`:
- Line 110: Remove or use the dead variable `valid_tok` computed as `valid_tok =
pl.min(TOK_TILE, seq_len_b - p0)` in deepseek_v3_2_prefill_back_scope12.py:
either delete this unused assignment or apply it to the post-normalization
padding step (similar to the logic in `scope2`) so padding respects the
remaining valid token count; locate the computation of `valid_tok` and mirror
the `scope2` padding behavior by using `valid_tok` when constructing or trimming
the padded tensor instead of leaving it unused.
- Around line 107-123: The RMS computation in scope12 omits zero-padding for
partial TOK_TILE tiles, causing inconsistency with scope2; modify the block
around tok_blocks/tok loop so x_chunk is created using pl.fillpad (or
equivalent) with valid_shape=[valid_tok, K_CHUNK] before reshaping/row-squaring,
ensuring any tokens beyond seq_len_b are zeroed before accumulating into sq_sum
(symbols to change: the x_chunk creation that slices resid1, the pl.incore loop
that builds sq_sum across HIDDEN_BLOCKS, and the subsequent inv_rms
computation); alternatively, if the no-padding behavior is intentional, add a
concise comment in that region explaining the assumption that tiles are full to
make the divergence explicit.
- Around line 224-225: The module-level _NODE_ID assignment is fragile; remove
the global reliance and instead accept node_id via the golden function's params
argument: update the golden function to read node_id = params.get('node_id') (or
a named parameter) and fall back to _NODE_ID only if params lacks it for
compatibility, then use that local variable everywhere instead of the global
_NODE_ID; this ensures thread-safety while keeping backward compatibility.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py`:
- Around line 199-201: The CLI defaults are set to production-size constants
(BATCH, MAX_SEQ, HIDDEN) in parser.add_argument calls; change these to the
smaller quick-test defaults used in scope1/scope12 (batch=2, max_seq=64,
hidden=512) so parser.add_argument("--batch", "--max-seq", and "--hidden" use
those values instead of BATCH/ MAX_SEQ/ HIDDEN; update any related help text or
docstring references if present to reflect the new test-friendly defaults.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 30faa6ab-8141-4121-9402-c0b86fdb350a

📥 Commits

Reviewing files that changed from the base of the PR and between 773aab2 and 9e20c42.

📒 Files selected for processing (3)
  • examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py
  • examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope12.py
  • examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py

Comment on lines +137 to +138
resid1_tile_full = resid1[b, p0:p0 + TOK_TILE, :]
resid1_tile_full[valid_tok:, :] = 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bug: Golden function mutates input tensor resid1.

resid1_tile_full is a view into tensors["resid1"]. Line 138 zeros rows in-place, corrupting the input tensor. This could cause issues if the tensor is inspected after the golden function runs (e.g., for debugging).

🐛 Proposed fix: Clone the tile before modifying
-            resid1_tile_full = resid1[b, p0:p0 + TOK_TILE, :]
-            resid1_tile_full[valid_tok:, :] = 0.0
+            resid1_tile_full = resid1[b, p0:p0 + TOK_TILE, :].clone()
+            resid1_tile_full[valid_tok:, :] = 0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resid1_tile_full = resid1[b, p0:p0 + TOK_TILE, :]
resid1_tile_full[valid_tok:, :] = 0.0
resid1_tile_full = resid1[b, p0:p0 + TOK_TILE, :].clone()
resid1_tile_full[valid_tok:, :] = 0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope2.py` around
lines 137 - 138, The golden function is mutating the input view resid1_tile_full
(a view of tensors["resid1"]) when it zeros rows, corrupting resid1; fix by
making an explicit copy of the slice before modification (e.g.,
clone/contiguous+clone the result of resid1[b, p0:p0 + TOK_TILE, :]) and perform
the zeroing on that cloned tensor so tensors["resid1"] is not altered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py (1)

68-78: Avoid rebuilding combine_local for every output block.

combine_local does not depend on ob, but Lines 71-77 re-read and reassemble the same combine_buf tile inside every output-block iteration. With the default shapes that is a large amount of redundant memory traffic. Build it once per p0 tile, then reuse it across all ob iterations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py` around
lines 68 - 78, The loop currently rebuilds combine_local inside the Q_OUT_BLOCKS
(ob) loop even though it only depends on p0 and combine_buf; move the entire
with pl.incore() / for cb in pl.range(COMBINE_BLOCKS) assembly of combine_local
out of the for ob in pl.range(Q_OUT_BLOCKS) loop so combine_local is constructed
once per p0 tile and then reused for each ob iteration; update any variable
scope so combine_local is defined before the ob loop and remove the duplicated
assembly code within the ob loop (references: combine_local, combine_buf, p0,
COMBINE_BLOCKS, COMBINE_CHUNK, TOK_TILE, Q_OUT_BLOCKS, ob).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py`:
- Around line 30-31: The golden path is using module-global _NODE_ID, making
golden_scope1() and related validation non-reentrant; change golden_scope1 to
accept node_id via its params (or return a closure that captures node_id) and
update compile_and_run() to pass the correct node_id into golden_scope1 (instead
of mutating _NODE_ID), and similarly update other usages mentioned (lines
referenced) so all validation uses the passed-in node_id rather than the
module-global.
- Around line 63-95: The code assumes full tile/chunk sizes (TOK_TILE,
COMBINE_CHUNK, Q_OUT_CHUNK) for every block; compute the tail sizes for the last
token/combine/output blocks (e.g., remaining_tok = min(TOK_TILE, seq_len_b -
p0), remaining_combine = min(COMBINE_CHUNK, combine_buf_cols - c0), remaining_q
= min(Q_OUT_CHUNK, attn_out_size - o0)) and use those when
creating/slicing/assembling buffers and when matmul inputs are shaped; update
uses of combine_local, pl.reshape(pl.slice(combine_buf,...)), pl.slice(wo,...),
pl.assemble(resid1,...), and pl.slice(hidden_states,...) to pass the computed
dynamic shapes/offsets (or assemble only the valid subregion) so the final
iteration does not read/write past valid ranges while preserving full-tile
buffers for intermediate accumulation when needed.

---

Nitpick comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py`:
- Around line 68-78: The loop currently rebuilds combine_local inside the
Q_OUT_BLOCKS (ob) loop even though it only depends on p0 and combine_buf; move
the entire with pl.incore() / for cb in pl.range(COMBINE_BLOCKS) assembly of
combine_local out of the for ob in pl.range(Q_OUT_BLOCKS) loop so combine_local
is constructed once per p0 tile and then reused for each ob iteration; update
any variable scope so combine_local is defined before the ob loop and remove the
duplicated assembly code within the ob loop (references: combine_local,
combine_buf, p0, COMBINE_BLOCKS, COMBINE_CHUNK, TOK_TILE, Q_OUT_BLOCKS, ob).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 08feab95-66b9-4ed0-9c6f-d35f568f34ad

📥 Commits

Reviewing files that changed from the base of the PR and between 9e20c42 and bd52126.

📒 Files selected for processing (1)
  • examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py

Comment on lines +30 to +31
_NODE_ID = 0 # Module-level variable for golden function

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the golden path stateless.

golden_scope1() reads node_id from the module-global _NODE_ID, and compile_and_run() mutates that global right before run(). That makes validation non-reentrant: concurrent runs, or two invocations with different node_ids in the same process, can validate against the wrong shard. Pass node_id through params or capture it in a closure instead of relying on shared module state.

Also applies to: 122-128, 166-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py` around
lines 30 - 31, The golden path is using module-global _NODE_ID, making
golden_scope1() and related validation non-reentrant; change golden_scope1 to
accept node_id via its params (or return a closure that captures node_id) and
update compile_and_run() to pass the correct node_id into golden_scope1 (instead
of mutating _NODE_ID), and similarly update other usages mentioned (lines
referenced) so all validation uses the passed-in node_id rather than the
module-global.

Comment on lines +63 to +95
tok_blocks = (seq_len_b + TOK_TILE - 1) // TOK_TILE
for p0_idx in pl.range(tok_blocks):
p0 = p0_idx * TOK_TILE
combine_local = pl.create_tensor([TOK_TILE, ATTN_OUT_CFG], dtype=pl.BF16)

for ob in pl.range(Q_OUT_BLOCKS):
o0 = ob * Q_OUT_CHUNK
with pl.incore():
for cb in pl.range(COMBINE_BLOCKS):
c0 = cb * COMBINE_CHUNK
chunk = pl.reshape(
pl.slice(combine_buf, [1, 1, TOK_TILE, COMBINE_CHUNK], [node_id, b, p0, c0]),
[TOK_TILE, COMBINE_CHUNK],
)
combine_local = pl.assemble(combine_local, chunk, [0, c0])
with pl.incore():
c_tile_0 = pl.slice(combine_local, [TOK_TILE, COMBINE_CHUNK], [0, 0])
w_tile_0 = pl.slice(wo, [COMBINE_CHUNK, Q_OUT_CHUNK], [0, o0])
o_acc = pl.matmul(c_tile_0, w_tile_0, out_dtype=pl.FP32)
for cb in pl.range(1, COMBINE_BLOCKS):
c0 = cb * COMBINE_CHUNK
c_tile_i = pl.slice(combine_local, [TOK_TILE, COMBINE_CHUNK], [0, c0])
w_tile_i = pl.slice(wo, [COMBINE_CHUNK, Q_OUT_CHUNK], [c0, o0])
o_acc = pl.matmul_acc(o_acc, c_tile_i, w_tile_i)
resid1 = pl.assemble(resid1, o_acc, [b, p0, o0])
with pl.incore():
proj = pl.slice(resid1, [1, TOK_TILE, Q_OUT_CHUNK], [b, p0, o0])
resid = pl.cast(
pl.slice(hidden_states, [1, TOK_TILE, Q_OUT_CHUNK], [b, p0, o0]),
target_type=pl.FP32,
)
resid1 = pl.assemble(resid1, pl.add(proj, resid), [b, p0, o0])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle tail tiles explicitly.

This loop counts token/output/combine blocks with ceil division, but every slice/assemble still uses full TOK_TILE, COMBINE_CHUNK, and Q_OUT_CHUNK. For non-aligned seq_len_b, max_seq_len, attn_out_size, or hidden_size, the last iteration will process past the valid region and can go out of bounds; even when it stays in bounds, it still writes padded tokens past seq_len_b. golden_scope1() has the same assumption, so the reference masks the bug instead of catching it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_prefill_back_scope1.py` around
lines 63 - 95, The code assumes full tile/chunk sizes (TOK_TILE, COMBINE_CHUNK,
Q_OUT_CHUNK) for every block; compute the tail sizes for the last
token/combine/output blocks (e.g., remaining_tok = min(TOK_TILE, seq_len_b -
p0), remaining_combine = min(COMBINE_CHUNK, combine_buf_cols - c0), remaining_q
= min(Q_OUT_CHUNK, attn_out_size - o0)) and use those when
creating/slicing/assembling buffers and when matmul inputs are shaped; update
uses of combine_local, pl.reshape(pl.slice(combine_buf,...)), pl.slice(wo,...),
pl.assemble(resid1,...), and pl.slice(hidden_states,...) to pass the computed
dynamic shapes/offsets (or assemble only the valid subregion) so the final
iteration does not read/write past valid ranges while preserving full-tile
buffers for intermediate accumulation when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant