Fix DeepSeek scope1 cache prep outputs#134
Conversation
📝 WalkthroughWalkthroughThe DeepSeek V3.2 decode front-scope program is updated with new inputs (sequence lengths, RoPE parameters, KV normalization weight) and restructured outputs. The control flow now performs RMS normalization on KV latents, splits Q heads into no-PE and RoPE-applied components, and writes normalized KV and rotated PE to token-indexed caches. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request expands the DeepSeek V3.2-EXP decode front path by implementing Q/K RoPE application, KV normalization, and cache updates. Review feedback identifies opportunities for performance optimization, including the removal of redundant writes to a temporary tensor and the fusion of two parallel loops to reduce overhead from repeated tensor reads and slicing.
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_lo, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG], | ||
| ) | ||
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_hi, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ) |
There was a problem hiding this comment.
The pl.assemble calls updating q_proj_out with rotated values are redundant. q_proj_out is a local temporary tensor created at line 125 and is not used after this loop (the function returns q_pe_out). Removing these unnecessary Global Memory writes will improve performance.
q_pe_out = pl.assemble(q_pe_out, pl.cast(q_rot_lo, target_type=pl.BF16), [b, q_pe_col])
q_pe_out = pl.assemble(
q_pe_out,
pl.cast(q_rot_hi, target_type=pl.BF16),
[b, q_pe_col + QK_ROPE_HEAD_DIM_CFG // 2],
)| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer): | ||
| for b in pl.parallel(0, BATCH_CFG, 1, chunk=4): | ||
| ctx_len = pl.tensor.read(seq_lens, [b]) | ||
| pos = ctx_len - 1 | ||
| cache_row = b * MAX_SEQ_CFG + pos | ||
|
|
||
| cos_lo = pl.slice(rope_cos, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, 0]) | ||
| cos_hi = pl.slice( | ||
| rope_cos, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, QK_ROPE_HEAD_DIM_CFG // 2] | ||
| ) | ||
| sin_lo = pl.slice(rope_sin, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, 0]) | ||
| sin_hi = pl.slice( | ||
| rope_sin, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, QK_ROPE_HEAD_DIM_CFG // 2] | ||
| ) | ||
|
|
||
| for h in pl.range(NUM_HEADS_CFG): | ||
| q_col = h * QK_HEAD_DIM_CFG | ||
| q_nope_col = h * QK_NOPE_HEAD_DIM_CFG | ||
| q_pe_col = h * QK_ROPE_HEAD_DIM_CFG | ||
| q_nope = pl.slice(q_proj_out, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]) | ||
| q_nope_out = pl.assemble(q_nope_out, q_nope, [b, q_nope_col]) | ||
| q_lo = pl.cast( | ||
| pl.slice( | ||
| q_proj_out, | ||
| [1, QK_ROPE_HEAD_DIM_CFG // 2], | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG], | ||
| ), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_hi = pl.cast( | ||
| pl.slice( | ||
| q_proj_out, | ||
| [1, QK_ROPE_HEAD_DIM_CFG // 2], | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_rot_lo = pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo)) | ||
| q_rot_hi = pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi)) | ||
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_lo, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG], | ||
| ) | ||
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_hi, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ) | ||
| q_pe_out = pl.assemble(q_pe_out, pl.cast(q_rot_lo, target_type=pl.BF16), [b, q_pe_col]) | ||
| q_pe_out = pl.assemble( | ||
| q_pe_out, | ||
| pl.cast(q_rot_hi, target_type=pl.BF16), | ||
| [b, q_pe_col + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ) | ||
|
|
||
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer): | ||
| for b in pl.parallel(0, BATCH_CFG, 1, chunk=4): | ||
| ctx_len = pl.tensor.read(seq_lens, [b]) | ||
| pos = ctx_len - 1 | ||
| cache_row = b * MAX_SEQ_CFG + pos | ||
|
|
||
| cos_lo = pl.slice(rope_cos, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, 0]) | ||
| cos_hi = pl.slice( | ||
| rope_cos, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, QK_ROPE_HEAD_DIM_CFG // 2] | ||
| ) | ||
| sin_lo = pl.slice(rope_sin, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, 0]) | ||
| sin_hi = pl.slice( | ||
| rope_sin, [1, QK_ROPE_HEAD_DIM_CFG // 2], [pos, QK_ROPE_HEAD_DIM_CFG // 2] | ||
| ) | ||
| kv_normed_row = pl.slice(kv_normed_out, [1, KV_LORA_RANK_CFG], [b, 0]) | ||
|
|
||
| pe_lo = pl.cast( | ||
| pl.slice(kv_a_out, [1, QK_ROPE_HEAD_DIM_CFG // 2], [b, KV_LORA_RANK_CFG]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| pe_hi = pl.cast( | ||
| pl.slice( | ||
| kv_a_out, | ||
| [1, QK_ROPE_HEAD_DIM_CFG // 2], | ||
| [b, KV_LORA_RANK_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ), | ||
| target_type=pl.FP32, | ||
| ) | ||
|
|
||
| pe_rot_lo = pl.sub(pl.col_expand_mul(pe_lo, cos_lo), pl.col_expand_mul(pe_hi, sin_lo)) | ||
| pe_rot_hi = pl.add(pl.col_expand_mul(pe_hi, cos_hi), pl.col_expand_mul(pe_lo, sin_hi)) | ||
|
|
||
| kv_cache = pl.assemble(kv_cache, kv_normed_row, [cache_row, 0]) | ||
| pe_cache = pl.assemble(pe_cache, pl.cast(pe_rot_lo, target_type=pl.BF16), [cache_row, 0]) | ||
| pe_cache = pl.assemble( | ||
| pe_cache, | ||
| pl.cast(pe_rot_hi, target_type=pl.BF16), | ||
| [cache_row, QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ) | ||
|
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py (1)
264-362: Consider merging the two per-batch parallel loops.Both
with pl.at(..., optimization=pl.chunked_loop_optimizer)blocks iteratepl.parallel(0, BATCH_CFG, 1, chunk=4)and independently re-readseq_lens[b], recomputepos/cache_row, and re-slice the same fourrope_cos/rope_sinhalves. Fusing them into a single loop would remove the duplicated reads and let the PE rotation sharecos_lo/cos_hi/sin_lo/sin_hiwith the Q-PE rotation. It would also let you drop the misplaced "Cache preparation: write RMS-normalized KV latent and rotated k_pe" portion of the comment at 264-267 (which currently describes the second loop but sits above the first).Leaving this as optional in case the split is intentional for scheduling/tiling reasons — please verify whether the chunked-loop optimizer or downstream compile passes prefer them separate before applying.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py` around lines 264 - 362, The two chunked-loop optimizer blocks both iterate pl.parallel over BATCH_CFG and duplicate reading seq_lens/pos/cache_row and slicing rope_cos/rope_sin; merge them into a single with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer) loop that contains both the Q split/RoPE operations (q_nope/q_pe, q_rot_lo/q_rot_hi, q_proj_out/q_pe_out assembly) and the KV cache preparation (kv_normed_row, pe_lo/pe_hi, pe_rot_lo/pe_rot_hi, kv_cache/pe_cache assembly), reuse the computed cos_lo/cos_hi/sin_lo/sin_hi and pos/cache_row to avoid redundant work, and remove or relocate the misleading cache-preparation comment currently sitting above the first block; keep the same variable names (q_proj_out, q_pe_out, q_nope_out, kv_normed_out, kv_a_out, kv_cache, pe_cache, rope_cos, rope_sin, seq_lens) so schedule semantics remain clear and run tests to ensure chunked_loop_optimizer behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py`:
- Around line 307-316: The writes that assemble q_rot_lo/q_rot_hi back into the
local q_proj_out buffer (via q_proj_out = pl.assemble(...)) are dead because
q_proj_out is never read later; only q_pe_out is returned and used — remove the
two in-place RoPE assemble calls that write to q_proj_out (the ones using
q_rot_lo and q_rot_hi with QK_NOPE_HEAD_DIM_CFG and QK_ROPE_HEAD_DIM_CFG
offsets) and clean up the surrounding comment near the earlier "keeping the
internal full Q layout" notes to reflect that we now directly assemble q_pe_out
from q_rot_lo/q_rot_hi instead of maintaining a full Q layout in q_proj_out.
Ensure references to q_rot_lo, q_rot_hi, q_proj_out, q_pe_out,
QK_NOPE_HEAD_DIM_CFG and QK_ROPE_HEAD_DIM_CFG are updated or removed in comments
as appropriate.
---
Nitpick comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py`:
- Around line 264-362: The two chunked-loop optimizer blocks both iterate
pl.parallel over BATCH_CFG and duplicate reading seq_lens/pos/cache_row and
slicing rope_cos/rope_sin; merge them into a single with
pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer) loop
that contains both the Q split/RoPE operations (q_nope/q_pe, q_rot_lo/q_rot_hi,
q_proj_out/q_pe_out assembly) and the KV cache preparation (kv_normed_row,
pe_lo/pe_hi, pe_rot_lo/pe_rot_hi, kv_cache/pe_cache assembly), reuse the
computed cos_lo/cos_hi/sin_lo/sin_hi and pos/cache_row to avoid redundant work,
and remove or relocate the misleading cache-preparation comment currently
sitting above the first block; keep the same variable names (q_proj_out,
q_pe_out, q_nope_out, kv_normed_out, kv_a_out, kv_cache, pe_cache, rope_cos,
rope_sin, seq_lens) so schedule semantics remain clear and run tests to ensure
chunked_loop_optimizer behavior is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 39726079-cc1a-47c0-bc9d-3068464bbbfc
📒 Files selected for processing (1)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_lo, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG], | ||
| ) | ||
| q_proj_out = pl.assemble( | ||
| q_proj_out, | ||
| pl.cast(q_rot_hi, target_type=pl.BF16), | ||
| [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | ||
| ) |
There was a problem hiding this comment.
Dead writes: rotated q_pe values stored back into local q_proj_out are never read.
q_proj_out is a local tensor (line 125) and is not consumed after this loop — return is q_pe_out, and the second parallel loop (lines 324-362) doesn't read from q_proj_out. The in-place RoPE write-back into the PE slice of q_proj_out is therefore wasted work inside a tight per-head loop. The comment at 264-265 about "keeping the internal full Q layout available for the in-place RoPE writes" no longer reflects the actual data flow — q_pe_out is assembled directly from q_rot_lo/q_rot_hi two lines below.
🧹 Proposed cleanup
q_rot_lo = pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo))
q_rot_hi = pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi))
- q_proj_out = pl.assemble(
- q_proj_out,
- pl.cast(q_rot_lo, target_type=pl.BF16),
- [b, q_col + QK_NOPE_HEAD_DIM_CFG],
- )
- q_proj_out = pl.assemble(
- q_proj_out,
- pl.cast(q_rot_hi, target_type=pl.BF16),
- [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2],
- )
q_pe_out = pl.assemble(q_pe_out, pl.cast(q_rot_lo, target_type=pl.BF16), [b, q_pe_col])
q_pe_out = pl.assemble(
q_pe_out,
pl.cast(q_rot_hi, target_type=pl.BF16),
[b, q_pe_col + QK_ROPE_HEAD_DIM_CFG // 2],
)And refresh the comment block around 264-265 accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| q_proj_out = pl.assemble( | |
| q_proj_out, | |
| pl.cast(q_rot_lo, target_type=pl.BF16), | |
| [b, q_col + QK_NOPE_HEAD_DIM_CFG], | |
| ) | |
| q_proj_out = pl.assemble( | |
| q_proj_out, | |
| pl.cast(q_rot_hi, target_type=pl.BF16), | |
| [b, q_col + QK_NOPE_HEAD_DIM_CFG + QK_ROPE_HEAD_DIM_CFG // 2], | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope1.py` around
lines 307 - 316, The writes that assemble q_rot_lo/q_rot_hi back into the local
q_proj_out buffer (via q_proj_out = pl.assemble(...)) are dead because
q_proj_out is never read later; only q_pe_out is returned and used — remove the
two in-place RoPE assemble calls that write to q_proj_out (the ones using
q_rot_lo and q_rot_hi with QK_NOPE_HEAD_DIM_CFG and QK_ROPE_HEAD_DIM_CFG
offsets) and clean up the surrounding comment near the earlier "keeping the
internal full Q layout" notes to reflect that we now directly assemble q_pe_out
from q_rot_lo/q_rot_hi instead of maintaining a full Q layout in q_proj_out.
Ensure references to q_rot_lo, q_rot_hi, q_proj_out, q_pe_out,
QK_NOPE_HEAD_DIM_CFG and QK_ROPE_HEAD_DIM_CFG are updated or removed in comments
as appropriate.
Summary
Related Issues
None