Update: replace decode-front with ds32exp and add scope3#137
Conversation
- Replace deepseek_v3_2_decode_front.py with ds32exp.py from ds32 branch: reorganised into 4 explicit scopes (qkv proj, indexer proj, score+topk, sparse MQA dispatch), adds Hadamard/FP8 placeholders, k_cache_idx write, and per-head weighted q_idx aggregation - Add deepseek_v3_2_decode_front_scope3.py (scope3.py from ds32): standalone scope covering score+topk via tiled QK matmul, sort32+mrgsort, and gather to produce topk_vals_out / topk_idx_out
📝 WalkthroughWalkthroughRefactored the DeepSeek V3.2 decode FRONT program from a single monolithic implementation into 4 explicit stages with updated tensor interfaces. Added new indexer-related kernel inputs, replaced cross-node dispatch indexing logic, and introduced a new top-k scoring program (scope3) for ranking cached keys. Removed prior dynamic shape configuration and adopted fixed compile-time constants. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the DeepSeek V3.2-EXP single-layer decode FRONT implementation into four distinct scopes: qkv projection/RoPE, indexer projection/RoPE, scoring/topk, and post-topk sparse attention. It introduces a new file for Scope 3 (scoring and topk) using sort32 and mrgsort. Feedback includes addressing performance inefficiencies in the online softmax accumulation and redundant tensor initializations within loops, as well as correcting a type mismatch in the Scope 3 function signature.
| oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0) | ||
| li = pl.full([1, 1], dtype=pl.FP32, value=0.0) | ||
| mi = pl.full([1, 1], dtype=pl.FP32, value=0.0) | ||
|
|
||
| sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len) | ||
| for kk in pl.range(sparse_k): | ||
| topk_pos = pl.tensor.read(topk_idx, [0, kk]) | ||
| topk_pos = pl.tensor.read(topk_idx, [b, kk]) | ||
| if topk_pos >= 0: | ||
| cache_s = b * MAX_SEQ_CFG + topk_pos | ||
| cache_s = b * MAX_SEQ + topk_pos | ||
| kv_s = pl.cast( | ||
| pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]), | ||
| pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| pe_s = pl.cast( | ||
| pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]), | ||
| pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s)) | ||
| score_pe = pl.row_sum(pl.mul(q_rot, pe_s)) | ||
| score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE) | ||
| cur_mi = score | ||
| cur_li = pl.exp(pl.sub(score, cur_mi)) | ||
| oi_tmp = pl.row_expand_mul(kv_s, cur_li) | ||
| score_pe = pl.row_sum(pl.mul(q_pe, pe_s)) | ||
| cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE) | ||
| cur_li = pl.full([1, 1], dtype=pl.FP32, value=1.0) | ||
| if kk == 0: | ||
| oi = oi_tmp | ||
| oi = kv_s | ||
| li = cur_li | ||
| mi = cur_mi | ||
| else: | ||
| mi_new = pl.maximum(mi, cur_mi) | ||
| alpha = pl.exp(pl.sub(mi, mi_new)) | ||
| beta = pl.exp(pl.sub(cur_mi, mi_new)) | ||
| li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li)) | ||
| oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta)) | ||
| oi = pl.add( | ||
| pl.row_expand_mul(oi, alpha), | ||
| pl.row_expand_mul(kv_s, beta), | ||
| ) | ||
| mi = mi_new |
There was a problem hiding this comment.
The online softmax accumulation inside the kk loop is inefficient. Creating the cur_li tensor (pl.full) in every iteration of a loop that can run up to 2048 times significantly impacts performance. Additionally, the if kk == 0 branch can be eliminated by initializing mi to the minimum possible float value and li to zero. This avoids branching and redundant allocations in a tight loop, making the code cleaner and more performant.
oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0)
li = pl.full([1, 1], dtype=pl.FP32, value=0.0)
mi = pl.full([1, 1], dtype=pl.FP32, value=-3.4028235e38)
for kk in pl.range(sparse_k):
topk_pos = pl.tensor.read(topk_idx, [b, kk])
if topk_pos >= 0:
cache_s = b * MAX_SEQ + topk_pos
kv_s = pl.cast(
pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]),
target_type=pl.FP32,
)
pe_s = pl.cast(
pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]),
target_type=pl.FP32,
)
score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s))
score_pe = pl.row_sum(pl.mul(q_pe, pe_s))
cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
mi_new = pl.maximum(mi, cur_mi)
alpha = pl.exp(pl.sub(mi, mi_new))
beta = pl.exp(pl.sub(cur_mi, mi_new))
li = pl.add(pl.mul(alpha, li), beta)
oi = pl.add(
pl.row_expand_mul(oi, alpha),
pl.row_expand_mul(kv_s, beta),
)
mi = mi_new| q_idx: pl.Tensor[[BATCH, INDEX_HEAD_DIM], pl.BF16], | ||
| k_cache_idx: pl.Tensor[[CACHE_ROWS_IDX, INDEX_HEAD_DIM], pl.BF16], | ||
| seq_lens: pl.Tensor[[BATCH], pl.INT32], | ||
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32], |
There was a problem hiding this comment.
There is a type mismatch for the idx_init parameter. The function signature uses pl.UINT32, but the corresponding TensorSpec in build_tensor_specs and the initialization in init_idx_init use torch.int32. Using pl.INT32 in the signature will ensure consistency with the input data type and avoid potential runtime errors.
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32], | |
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32], |
| # Transient GM buffers. | ||
| scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32) | ||
| sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32) | ||
| raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32) | ||
|
|
||
| for b in pl.range(0, BATCH, 1): | ||
| ctx_len = pl.tensor.read(seq_lens, [b]) | ||
| ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE | ||
|
|
||
| # Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel | ||
| # loop only covers [0, ctx_blocks), so untouched tail slots | ||
| # keep the create_tensor default of 0.0 without this. | ||
| with pl.at(level=pl.Level.CORE_GROUP): | ||
| neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF) | ||
| scores = pl.assemble(scores, neg_inf_row, [b, 0]) |
There was a problem hiding this comment.
The neg_inf_row tensor is initialized inside the sequential batch loop. Moving this initialization outside the loop avoids redundant tensor creation and initialization for each batch, which improves performance.
| # Transient GM buffers. | |
| scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32) | |
| sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32) | |
| raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32) | |
| for b in pl.range(0, BATCH, 1): | |
| ctx_len = pl.tensor.read(seq_lens, [b]) | |
| ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE | |
| # Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel | |
| # loop only covers [0, ctx_blocks), so untouched tail slots | |
| # keep the create_tensor default of 0.0 without this. | |
| with pl.at(level=pl.Level.CORE_GROUP): | |
| neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF) | |
| scores = pl.assemble(scores, neg_inf_row, [b, 0]) | |
| # Transient GM buffers. | |
| scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32) | |
| sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32) | |
| raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32) | |
| neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF) | |
| for b in pl.range(0, BATCH, 1): | |
| ctx_len = pl.tensor.read(seq_lens, [b]) | |
| ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE | |
| # Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel | |
| # loop only covers [0, ctx_blocks), so untouched tail slots | |
| # keep the create_tensor default of 0.0 without this. | |
| with pl.at(level=pl.Level.CORE_GROUP): | |
| scores = pl.assemble(scores, neg_inf_row, [b, 0]) |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py (1)
36-37:⚠️ Potential issue | 🟠 Major
MAX_SEQdiverges fromdecode_front_scope3.py(4096 vs 8192).
decode_front_scope3.py(the real score+topk implementation intended to replace Stage 3's stub) declaresMAX_SEQ = 8192, while this file uses4096. BecauseCACHE_ROWS = BATCH * MAX_SEQandk_cache_idx/kv_cache/pe_cacheare sized off this constant, the two scopes cannot be wired together as-is — cache shapes won't line up. Please align the constants (ideally by extracting them into a shared module) before integrating scope3.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 36 - 37, MAX_SEQ in deepseek_v3_2_decode_front.py is set to 4096 while decode_front_scope3.py uses 8192, causing mismatched cache shapes because CACHE_ROWS = BATCH * MAX_SEQ and k_cache_idx / kv_cache / pe_cache are sized from it; fix by making the two files use the same constant (preferably extract BATCH and MAX_SEQ into a shared config module and import them in both files) and then update CACHE_ROWS and any cache allocation logic (k_cache_idx, kv_cache, pe_cache) to reference the shared constants so shapes line up when wiring scope3.
🧹 Nitpick comments (2)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py (2)
567-569: Inconsistentchunk=values across parallel loops.Other parallel-over-batch loops in this file use
chunk=BATCH // BATCH_TILE(= 4) orchunk=BATCH, while scope 4's outer batch loop and the dispatch loop use a literalchunk=4(line 569, 658), and the inner head loop uses a literalchunk=8(line 574). That happens to equalBATCH // BATCH_TILEtoday, but ifBATCHorBATCH_TILEchanges, scope 4 silently drifts out of sync. Prefer deriving these from the existing constants (e.g.,chunk=BATCH // BATCH_TILE,chunk=NUM_HEADS // HEAD_TILE).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 567 - 569, The parallel loops use hard-coded chunk sizes that can diverge from constants; update the pl.parallel calls around attn_front and related loops (the outer batch loop, the dispatch loop, and the inner head loop) to derive chunk values from the existing constants instead of literals — use chunk=BATCH // BATCH_TILE for batch-level loops and chunk=NUM_HEADS // HEAD_TILE for head-level loops (or chunk=BATCH where other loops use full-batch tiling) so all pl.parallel(...) invocations remain consistent with BATCH, BATCH_TILE, NUM_HEADS, and HEAD_TILE.
522-564: Stage 3 computesscoresthat are then thrown away.Stages 3.1a/3.1b materialize the full
[BATCH, MAX_SEQ]scores tensor, but Stage 3.2 ignores it and producestopk_idxas all zeros. Onlytopk_idxfeeds scope 4, so the scoring work is currently dead on-device compute and transient GM. Two options, pick whichever fits your timeline:
- Guard the scoring under an
if 0:/ behind a config flag until the real topk lands, to save kernel compile/run time during iteration.- Integrate
deepseek_v3_2_decode_front_scope3.pyhere now — it already emitstopk_vals_out/topk_idx_outwith the INT32_MIN padding convention that matches thetopk_pos >= 0filter in scope 4 (line 607).Additionally, note that
scoreshere is never explicitly initialized to-infbefore Stage 3.1b writes only the[0, ctx_blocks)range;decode_front_scope3.py:100-105has exactly the pre-fill you'll need once topk is real.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 522 - 564, The scoring loop currently computes a full scores tensor (scores) which is never used because topk_idx is hardcoded to zeros, wasting on-device compute and memory; either guard the scoring stage behind a config flag or if 0 to skip it during iteration, or replace the placeholder with the real scope3 implementation (import/inline logic from deepseek_v3_2_decode_front_scope3.py to produce topk_vals_out/topk_idx_out using the INT32_MIN padding convention consumed by the topk_pos >= 0 filter in scope4). Also explicitly initialize scores to -inf (use PadValue.min or the same pre-fill used in decode_front_scope3.py) before Stage 3.1 so partial writes and fillpad behavior are correct once a real topk is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py`:
- Around line 35-49: The constants BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK,
and CACHE_ROWS_IDX in deepseek_v3_2_decode_front_scope3.py must match
deepseek_v3_2_decode_front.py to avoid silent shape mismatches (k_cache_idx);
either extract these shared constants into a common module (e.g., shared_config)
and import them in both files, or add a runtime assertion in
deepseek_v3_2_decode_front_scope3.py that compares its BATCH, MAX_SEQ,
INDEX_HEAD_DIM, INDEX_TOPK, CACHE_ROWS_IDX against the values from
deepseek_v3_2_decode_front.py and fails fast with a clear message; update
references to CACHE_ROWS_IDX = BATCH * MAX_SEQ to use the shared constant so
both scopes remain consistent.
- Line 64: The kernel signature currently declares idx_init as pl.UINT32 but the
code initializes it with torch.int32 and uses INT32_MIN sentinel values; update
the kernel parameter type for idx_init to pl.INT32 to match the actual data and
the TensorSpec (and to align with topk_idx_out which is pl.INT32); locate the
declaration of idx_init in the kernel signature and change its type from
pl.UINT32 to pl.INT32, and ensure any associated TensorSpec references (e.g.,
where idx_init is described) are consistent with INT32.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 560-634: Scope 4 uses a zeroed topk_idx stub so the inner loop
repeatedly attends to position 0; to fix, either (A) short-circuit the inner
loop when topk_idx is a stub by capping sparse_k to 1 (i.e., set sparse_k =
pl.minimum(sparse_k, 1) or otherwise force a single kk iteration) to avoid
redundant updates, or (B) wire in the real index generation by invoking
decode_front_scope3 (or the routine that fills topk_idx) before Scope 4 so
topk_idx contains real positions; locate uses of topk_idx, sparse_k, topk_pos
and cache_s in Scope 4 and implement one of these fixes so the loop does not
repeatedly read cache row b*MAX_SEQ+0.
- Around line 206-211: The comment in deepseek_v3_2_decode_front.py incorrectly
self-references lines 241-274; update the comment in the RoPE block (the Stage
1.5 / MLA RoPE comment) to reference the actual prior implementation file or
module (for example ds_q0_rope.py or the correct renamed source) or remove the
line-numbered self-reference altogether; ensure the comment mentions the correct
symbol/module that contains the monolithic implementation so readers can find
the original lo/hi half split convention.
---
Outside diff comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 36-37: MAX_SEQ in deepseek_v3_2_decode_front.py is set to 4096
while decode_front_scope3.py uses 8192, causing mismatched cache shapes because
CACHE_ROWS = BATCH * MAX_SEQ and k_cache_idx / kv_cache / pe_cache are sized
from it; fix by making the two files use the same constant (preferably extract
BATCH and MAX_SEQ into a shared config module and import them in both files) and
then update CACHE_ROWS and any cache allocation logic (k_cache_idx, kv_cache,
pe_cache) to reference the shared constants so shapes line up when wiring
scope3.
---
Nitpick comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 567-569: The parallel loops use hard-coded chunk sizes that can
diverge from constants; update the pl.parallel calls around attn_front and
related loops (the outer batch loop, the dispatch loop, and the inner head loop)
to derive chunk values from the existing constants instead of literals — use
chunk=BATCH // BATCH_TILE for batch-level loops and chunk=NUM_HEADS // HEAD_TILE
for head-level loops (or chunk=BATCH where other loops use full-batch tiling) so
all pl.parallel(...) invocations remain consistent with BATCH, BATCH_TILE,
NUM_HEADS, and HEAD_TILE.
- Around line 522-564: The scoring loop currently computes a full scores tensor
(scores) which is never used because topk_idx is hardcoded to zeros, wasting
on-device compute and memory; either guard the scoring stage behind a config
flag or if 0 to skip it during iteration, or replace the placeholder with the
real scope3 implementation (import/inline logic from
deepseek_v3_2_decode_front_scope3.py to produce topk_vals_out/topk_idx_out using
the INT32_MIN padding convention consumed by the topk_pos >= 0 filter in
scope4). Also explicitly initialize scores to -inf (use PadValue.min or the same
pre-fill used in decode_front_scope3.py) before Stage 3.1 so partial writes and
fillpad behavior are correct once a real topk is used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72f01501-50d0-4ec4-8c70-8c6e49e07647
📒 Files selected for processing (2)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.pyexamples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
| BATCH = 16 | ||
| MAX_SEQ = 8192 | ||
| INDEX_HEAD_DIM = 128 | ||
| INDEX_TOPK = 2048 | ||
| CACHE_ROWS_IDX = BATCH * MAX_SEQ | ||
|
|
||
| SEQ_TILE = 64 | ||
| MAX_SEQ_BLOCKS = (MAX_SEQ + SEQ_TILE - 1) // SEQ_TILE | ||
|
|
||
| # Q pad: a2a3 TExtract requires row % 16 == 0, so pad the 1-row query to 16. | ||
| Q_VALID = 1 | ||
| Q_PAD = 16 | ||
|
|
||
| # sort32 + 4 mrgsort iterations (block_len 64,256,1024,4096) sort MAX_SEQ=8192. | ||
| MRGSORT_ITERS = 4 |
There was a problem hiding this comment.
MAX_SEQ = 8192 diverges from deepseek_v3_2_decode_front.py (4096).
The AI summary calls out that this file provides the real topk intended to replace the stub in ds32exp_decode_front. For that integration to work, BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK, and CACHE_ROWS_IDX must match between the two files. Today they don't (MAX_SEQ in particular). Please extract the shared constants into a common module (or at least add a matching assertion) so the two scopes can be composed without a silent shape mismatch on k_cache_idx.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py` around
lines 35 - 49, The constants BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK, and
CACHE_ROWS_IDX in deepseek_v3_2_decode_front_scope3.py must match
deepseek_v3_2_decode_front.py to avoid silent shape mismatches (k_cache_idx);
either extract these shared constants into a common module (e.g., shared_config)
and import them in both files, or add a runtime assertion in
deepseek_v3_2_decode_front_scope3.py that compares its BATCH, MAX_SEQ,
INDEX_HEAD_DIM, INDEX_TOPK, CACHE_ROWS_IDX against the values from
deepseek_v3_2_decode_front.py and fails fast with a clear message; update
references to CACHE_ROWS_IDX = BATCH * MAX_SEQ to use the shared constant so
both scopes remain consistent.
| q_idx: pl.Tensor[[BATCH, INDEX_HEAD_DIM], pl.BF16], | ||
| k_cache_idx: pl.Tensor[[CACHE_ROWS_IDX, INDEX_HEAD_DIM], pl.BF16], | ||
| seq_lens: pl.Tensor[[BATCH], pl.INT32], | ||
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm how idx_init is declared/typed elsewhere in the repo, and how sort32
# expects its index operand.
rg -nP --type=py -C3 '\bidx_init\b'
rg -nP --type=py -C3 'pl\.tile\.sort32\s*\('Repository: hw-native-sys/pypto-lib
Length of output: 3726
🏁 Script executed:
# Check lines around 236 for init_idx_init initialization
sed -n '230,245p' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
# Search for INT32_MIN in the file
rg -n 'INT32_MIN|INT32MIN|int32_min|int.min' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
# Check the kernel implementation for any sentinel values
rg -n 'sentinel|INT32' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.pyRepository: hw-native-sys/pypto-lib
Length of output: 1638
🏁 Script executed:
# Search for sort32 definition or documentation
rg -n 'def sort32|sort32.*signature|sort32.*idx' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
# Check if there are other kernel declarations using similar patterns
rg -n 'pl\.Tensor.*pl\.(INT32|UINT32)' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.pyRepository: hw-native-sys/pypto-lib
Length of output: 379
Change idx_init from UINT32 to INT32 in both kernel signature (line 64) and TensorSpec (line 253).
The kernel declares idx_init as pl.UINT32, but it is initialized with torch.arange(MAX_SEQ, dtype=torch.int32) (line 236) and defined as torch.int32 in the TensorSpec (line 253). More critically, the algorithm uses INT32_MIN as a sentinel value for padding (lines 212, 175, 226), which cannot be represented in an unsigned integer. Align to INT32 to match the actual data type, enable the sentinel value, and maintain consistency with topk_idx_out which is already declared as INT32.
Changes needed
Line 64:
- idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
+ idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32],Line 253:
- TensorSpec("idx_init", [1, MAX_SEQ], torch.int32, init_value=init_idx_init),
+ TensorSpec("idx_init", [1, MAX_SEQ], torch.int32, init_value=init_idx_init),(Line 253 is already correct and requires no change.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32], | |
| idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32], |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py` at line
64, The kernel signature currently declares idx_init as pl.UINT32 but the code
initializes it with torch.int32 and uses INT32_MIN sentinel values; update the
kernel parameter type for idx_init to pl.INT32 to match the actual data and the
TensorSpec (and to align with topk_idx_out which is pl.INT32); locate the
declaration of idx_init in the kernel signature and change its type from
pl.UINT32 to pl.INT32, and ensure any associated TensorSpec references (e.g.,
where idx_init is described) are consistent with INT32.
| # Stage 1.5: q_pe RoPE on every MLA head, k_pe RoPE on kv_a, kv_norm, | ||
| # write kv_cache and pe_cache at row b*MAX_SEQ + (seq_lens[b]-1). | ||
| # NOTE: official applies interleaved=True for MLA, but the existing | ||
| # decode/prefill paths in this repo use the lo/hi half split form | ||
| # (see deepseek_v3_2_decode_front.py:241-274). We follow the same | ||
| # convention so the cached pe matches the in-tree consumers. |
There was a problem hiding this comment.
Stale self-reference in the comment.
"the existing decode/prefill paths in this repo use the lo/hi half split form (see deepseek_v3_2_decode_front.py:241-274)" — this is deepseek_v3_2_decode_front.py, and the referenced line range points inside the RoPE block you're writing, not the prior monolithic implementation you intend to cite. Please update the reference to point to the actual source (e.g., ds_q0_rope.py, or whatever remained after the rename).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 206
- 211, The comment in deepseek_v3_2_decode_front.py incorrectly self-references
lines 241-274; update the comment in the RoPE block (the Stage 1.5 / MLA RoPE
comment) to reference the actual prior implementation file or module (for
example ds_q0_rope.py or the correct renamed source) or remove the line-numbered
self-reference altogether; ensure the comment mentions the correct symbol/module
that contains the monolithic implementation so readers can find the original
lo/hi half split convention.
| topk_idx = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32) | ||
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer): | ||
| for b in pl.parallel(0, BATCH, 1, chunk=BATCH): | ||
| zero_row = pl.full([1, INDEX_TOPK], dtype=pl.INT32, value=0) | ||
| topk_idx = pl.assemble(topk_idx, zero_row, [b, 0]) | ||
|
|
||
| # ── Scope 4: post topk (sparse MQA + dispatch) ── | ||
| attn_front = pl.create_tensor([BATCH, ATTN_OUT], dtype=pl.FP32) | ||
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer): | ||
| for b in pl.parallel(0, BATCH, 1, chunk=4): | ||
| ctx_len = pl.tensor.read(seq_lens, [b]) | ||
| sparse_k = pl.min(INDEX_TOPK, ctx_len) | ||
| attn_row = pl.full([1, ATTN_OUT], dtype=pl.FP32, value=0.0) | ||
|
|
||
| for h in pl.parallel(0, NUM_HEADS, 1, chunk=8): | ||
| q_col = h * QK_HEAD_DIM | ||
| # q_pe was already RoPE-rotated in Scope 1, so we read it | ||
| # back as-is. q_nope is projected to the latent space | ||
| # via per-head w_q_nope_to_latent. | ||
| q_nope = pl.cast( | ||
| pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]), | ||
| pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM], [b, q_col]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_pe = pl.cast( | ||
| pl.slice(q_proj, [1, QK_ROPE_HEAD_DIM_CFG], [b, q_col + QK_NOPE_HEAD_DIM_CFG]), | ||
| pl.slice( | ||
| q_proj, [1, QK_ROPE_HEAD_DIM], [b, q_col + QK_NOPE_HEAD_DIM] | ||
| ), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_lo = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, 0]) | ||
| q_hi = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, QK_ROPE_HEAD_DIM_CFG // 2]) | ||
| q_rot = pl.create_tensor([1, QK_ROPE_HEAD_DIM_CFG], dtype=pl.FP32) | ||
| q_rot = pl.assemble( | ||
| q_rot, | ||
| pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo)), | ||
| [0, 0], | ||
| ) | ||
| q_rot = pl.assemble( | ||
| q_rot, | ||
| pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi)), | ||
| [0, QK_ROPE_HEAD_DIM_CFG // 2], | ||
| w_qn_h = pl.reshape( | ||
| pl.slice( | ||
| w_q_nope_to_latent, | ||
| [1, QK_NOPE_HEAD_DIM, KV_LORA_RANK], | ||
| [h, 0, 0], | ||
| ), | ||
| [QK_NOPE_HEAD_DIM, KV_LORA_RANK], | ||
| ) | ||
| q_nope_latent = pl.matmul( | ||
| pl.cast(q_nope, target_type=pl.BF16), | ||
| pl.reshape( | ||
| pl.slice( | ||
| w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0] | ||
| ), | ||
| [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], | ||
| ), | ||
| pl.cast(q_nope, target_type=pl.BF16), w_qn_h, out_dtype=pl.FP32 | ||
| ) | ||
|
|
||
| oi = pl.create_tensor([1, KV_LORA_RANK_CFG], dtype=pl.FP32) | ||
| li = pl.create_tensor([1, 1], dtype=pl.FP32) | ||
| mi = pl.create_tensor([1, 1], dtype=pl.FP32) | ||
| oi = pl.mul(oi, 0.0) | ||
| li = pl.mul(li, 0.0) | ||
| mi = pl.mul(mi, 0.0) | ||
| oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0) | ||
| li = pl.full([1, 1], dtype=pl.FP32, value=0.0) | ||
| mi = pl.full([1, 1], dtype=pl.FP32, value=0.0) | ||
|
|
||
| sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len) | ||
| for kk in pl.range(sparse_k): | ||
| topk_pos = pl.tensor.read(topk_idx, [0, kk]) | ||
| topk_pos = pl.tensor.read(topk_idx, [b, kk]) | ||
| if topk_pos >= 0: | ||
| cache_s = b * MAX_SEQ_CFG + topk_pos | ||
| cache_s = b * MAX_SEQ + topk_pos | ||
| kv_s = pl.cast( | ||
| pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]), | ||
| pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| pe_s = pl.cast( | ||
| pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]), | ||
| pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s)) | ||
| score_pe = pl.row_sum(pl.mul(q_rot, pe_s)) | ||
| score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE) | ||
| cur_mi = score | ||
| cur_li = pl.exp(pl.sub(score, cur_mi)) | ||
| oi_tmp = pl.row_expand_mul(kv_s, cur_li) | ||
| score_pe = pl.row_sum(pl.mul(q_pe, pe_s)) | ||
| cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE) | ||
| cur_li = pl.full([1, 1], dtype=pl.FP32, value=1.0) | ||
| if kk == 0: | ||
| oi = oi_tmp | ||
| oi = kv_s | ||
| li = cur_li | ||
| mi = cur_mi | ||
| else: | ||
| mi_new = pl.maximum(mi, cur_mi) | ||
| alpha = pl.exp(pl.sub(mi, mi_new)) | ||
| beta = pl.exp(pl.sub(cur_mi, mi_new)) | ||
| li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li)) | ||
| oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta)) | ||
| oi = pl.add( | ||
| pl.row_expand_mul(oi, alpha), | ||
| pl.row_expand_mul(kv_s, beta), | ||
| ) | ||
| mi = mi_new |
There was a problem hiding this comment.
Scope 4 silently degenerates to "attend only to position 0" with the current stub.
With topk_idx all-zero and topk_pos = 0 >= 0 every iteration, the inner loop (lines 605–634) invokes the online softmax update sparse_k = min(INDEX_TOPK, ctx_len) times against the same kv_s/pe_s at cache_row = b*MAX_SEQ + 0. The math is self-consistent (α=β=1 each step so ctx_latent == kv_s[0]) and matches the golden, so the test will pass, but the result is not meaningful attention.
This is documented as a TODO, but I'd flag two risks:
- If someone benchmarks this program they'll be measuring
sparse_kredundant online-softmax updates against position 0, not real sparse MQA; perf numbers may be misleading. - The placeholder also hides any bug in
pl.tensor.read(topk_idx, [b, kk])/ bounds behavior that would show up only oncetopk_idxhas non-zero entries.
Consider at minimum short-circuiting scope 4 to a single kk=0 step while the stub is in place, or wiring in decode_front_scope3 so scope 4 runs against real indices.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 560
- 634, Scope 4 uses a zeroed topk_idx stub so the inner loop repeatedly attends
to position 0; to fix, either (A) short-circuit the inner loop when topk_idx is
a stub by capping sparse_k to 1 (i.e., set sparse_k = pl.minimum(sparse_k, 1) or
otherwise force a single kk iteration) to avoid redundant updates, or (B) wire
in the real index generation by invoking decode_front_scope3 (or the routine
that fills topk_idx) before Scope 4 so topk_idx contains real positions; locate
uses of topk_idx, sparse_k, topk_pos and cache_s in Scope 4 and implement one of
these fixes so the loop does not repeatedly read cache row b*MAX_SEQ+0.
Summary
deepseek_v3_2_decode_front.pywithds32exp.pyfrom theds32branch: reorganised into 4 explicit scopes (qkv proj, indexer proj, score+topk, sparse MQA dispatch), adds Hadamard/FP8 placeholders, k_cache_idx write, and per-head weighted q_idx aggregationdeepseek_v3_2_decode_front_scope3.py(fromscope3.pyonds32): standalone scope covering score+topk via tiled QK matmul, sort32+mrgsort pipeline, and gather to producetopk_vals_out/topk_idx_outRelated Issues