Skip to content

Update: replace decode-front with ds32exp and add scope3#137

Merged
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
zhangqi-chen:feat/ds32-decode-front-scope3
Apr 21, 2026
Merged

Update: replace decode-front with ds32exp and add scope3#137
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
zhangqi-chen:feat/ds32-decode-front-scope3

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Collaborator

Summary

  • Replace deepseek_v3_2_decode_front.py with ds32exp.py from the ds32 branch: reorganised into 4 explicit scopes (qkv proj, indexer proj, score+topk, sparse MQA dispatch), adds Hadamard/FP8 placeholders, k_cache_idx write, and per-head weighted q_idx aggregation
  • Add deepseek_v3_2_decode_front_scope3.py (from scope3.py on ds32): standalone scope covering score+topk via tiled QK matmul, sort32+mrgsort pipeline, and gather to produce topk_vals_out / topk_idx_out

Related Issues

- Replace deepseek_v3_2_decode_front.py with ds32exp.py from ds32
  branch: reorganised into 4 explicit scopes (qkv proj, indexer proj,
  score+topk, sparse MQA dispatch), adds Hadamard/FP8 placeholders,
  k_cache_idx write, and per-head weighted q_idx aggregation
- Add deepseek_v3_2_decode_front_scope3.py (scope3.py from ds32):
  standalone scope covering score+topk via tiled QK matmul,
  sort32+mrgsort, and gather to produce topk_vals_out / topk_idx_out
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Refactored the DeepSeek V3.2 decode FRONT program from a single monolithic implementation into 4 explicit stages with updated tensor interfaces. Added new indexer-related kernel inputs, replaced cross-node dispatch indexing logic, and introduced a new top-k scoring program (scope3) for ranking cached keys. Removed prior dynamic shape configuration and adopted fixed compile-time constants.

Changes

Cohort / File(s) Summary
Main decode FRONT refactoring
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py
Restructured monolithic implementation into 4 explicit stages: qkv proj+RoPE, indexer proj+RoPE+placeholders, score+topk placeholder, sparse post-topk attention+dispatch. Updated kernel signature to add k_cache_idx, wq_b_idx, wk_idx, k_norm_weight, k_norm_bias, weights_proj inputs and replace dynamic *_CFG parameters with fixed constants. Changed dispatch indexing from (b + layer_id) % EP_NODES_CFG with intermediate buffer to fixed dispatch_buf[EP_NODES, BATCH, ATTN_OUT]. Replaced prior top-k implementation with all-zero stub.
New top-k scoring program
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
Added new program computing per-batch top-INDEX_TOPK dot-product scores between query vector and cached keys via tiled Q×K matmul, sort-merge ranking, and gather operations. Outputs both topk_vals_out (FP32) and topk_idx_out (INT32) with padding of invalid index slots to INT32_MIN.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 Four stages dance where once stood one,
Topk scoring shines beneath the sun,
From monolithic depths we hop and weave,
New tensor threads our hopes reprieve! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: replacing decode-front with ds32exp and adding scope3, which matches the file modifications.
Description check ✅ Passed The description is directly related to the changeset, providing context on the replacement of deepseek_v3_2_decode_front.py and addition of deepseek_v3_2_decode_front_scope3.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DeepSeek V3.2-EXP single-layer decode FRONT implementation into four distinct scopes: qkv projection/RoPE, indexer projection/RoPE, scoring/topk, and post-topk sparse attention. It introduces a new file for Scope 3 (scoring and topk) using sort32 and mrgsort. Feedback includes addressing performance inefficiencies in the online softmax accumulation and redundant tensor initializations within loops, as well as correcting a type mismatch in the Scope 3 function signature.

Comment on lines +601 to 634
oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0)
li = pl.full([1, 1], dtype=pl.FP32, value=0.0)
mi = pl.full([1, 1], dtype=pl.FP32, value=0.0)

sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len)
for kk in pl.range(sparse_k):
topk_pos = pl.tensor.read(topk_idx, [0, kk])
topk_pos = pl.tensor.read(topk_idx, [b, kk])
if topk_pos >= 0:
cache_s = b * MAX_SEQ_CFG + topk_pos
cache_s = b * MAX_SEQ + topk_pos
kv_s = pl.cast(
pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]),
pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]),
target_type=pl.FP32,
)
pe_s = pl.cast(
pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]),
pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]),
target_type=pl.FP32,
)
score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s))
score_pe = pl.row_sum(pl.mul(q_rot, pe_s))
score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
cur_mi = score
cur_li = pl.exp(pl.sub(score, cur_mi))
oi_tmp = pl.row_expand_mul(kv_s, cur_li)
score_pe = pl.row_sum(pl.mul(q_pe, pe_s))
cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
cur_li = pl.full([1, 1], dtype=pl.FP32, value=1.0)
if kk == 0:
oi = oi_tmp
oi = kv_s
li = cur_li
mi = cur_mi
else:
mi_new = pl.maximum(mi, cur_mi)
alpha = pl.exp(pl.sub(mi, mi_new))
beta = pl.exp(pl.sub(cur_mi, mi_new))
li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li))
oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta))
oi = pl.add(
pl.row_expand_mul(oi, alpha),
pl.row_expand_mul(kv_s, beta),
)
mi = mi_new
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The online softmax accumulation inside the kk loop is inefficient. Creating the cur_li tensor (pl.full) in every iteration of a loop that can run up to 2048 times significantly impacts performance. Additionally, the if kk == 0 branch can be eliminated by initializing mi to the minimum possible float value and li to zero. This avoids branching and redundant allocations in a tight loop, making the code cleaner and more performant.

                        oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0)
                        li = pl.full([1, 1], dtype=pl.FP32, value=0.0)
                        mi = pl.full([1, 1], dtype=pl.FP32, value=-3.4028235e38)

                        for kk in pl.range(sparse_k):
                            topk_pos = pl.tensor.read(topk_idx, [b, kk])
                            if topk_pos >= 0:
                                cache_s = b * MAX_SEQ + topk_pos
                                kv_s = pl.cast(
                                    pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]),
                                    target_type=pl.FP32,
                                )
                                pe_s = pl.cast(
                                    pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]),
                                    target_type=pl.FP32,
                                )
                                score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s))
                                score_pe = pl.row_sum(pl.mul(q_pe, pe_s))
                                cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)

                                mi_new = pl.maximum(mi, cur_mi)
                                alpha = pl.exp(pl.sub(mi, mi_new))
                                beta = pl.exp(pl.sub(cur_mi, mi_new))
                                li = pl.add(pl.mul(alpha, li), beta)
                                oi = pl.add(
                                    pl.row_expand_mul(oi, alpha),
                                    pl.row_expand_mul(kv_s, beta),
                                )
                                mi = mi_new

q_idx: pl.Tensor[[BATCH, INDEX_HEAD_DIM], pl.BF16],
k_cache_idx: pl.Tensor[[CACHE_ROWS_IDX, INDEX_HEAD_DIM], pl.BF16],
seq_lens: pl.Tensor[[BATCH], pl.INT32],
idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a type mismatch for the idx_init parameter. The function signature uses pl.UINT32, but the corresponding TensorSpec in build_tensor_specs and the initialization in init_idx_init use torch.int32. Using pl.INT32 in the signature will ensure consistency with the input data type and avoid potential runtime errors.

Suggested change
idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32],

Comment on lines +91 to +105
# Transient GM buffers.
scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32)
sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32)
raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32)

for b in pl.range(0, BATCH, 1):
ctx_len = pl.tensor.read(seq_lens, [b])
ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE

# Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel
# loop only covers [0, ctx_blocks), so untouched tail slots
# keep the create_tensor default of 0.0 without this.
with pl.at(level=pl.Level.CORE_GROUP):
neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF)
scores = pl.assemble(scores, neg_inf_row, [b, 0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The neg_inf_row tensor is initialized inside the sequential batch loop. Moving this initialization outside the loop avoids redundant tensor creation and initialization for each batch, which improves performance.

Suggested change
# Transient GM buffers.
scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32)
sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32)
raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32)
for b in pl.range(0, BATCH, 1):
ctx_len = pl.tensor.read(seq_lens, [b])
ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE
# Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel
# loop only covers [0, ctx_blocks), so untouched tail slots
# keep the create_tensor default of 0.0 without this.
with pl.at(level=pl.Level.CORE_GROUP):
neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF)
scores = pl.assemble(scores, neg_inf_row, [b, 0])
# Transient GM buffers.
scores = pl.create_tensor([BATCH, MAX_SEQ], dtype=pl.FP32)
sorted_gm = pl.create_tensor([BATCH, 2 * MAX_SEQ], dtype=pl.FP32)
raw_idx_gm = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32)
neg_inf_row = pl.full([1, MAX_SEQ], dtype=pl.FP32, value=FP32_NEG_INF)
for b in pl.range(0, BATCH, 1):
ctx_len = pl.tensor.read(seq_lens, [b])
ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE
# Stage 0: pre-fill scores[b] with -inf. Stage 2's parallel
# loop only covers [0, ctx_blocks), so untouched tail slots
# keep the create_tensor default of 0.0 without this.
with pl.at(level=pl.Level.CORE_GROUP):
scores = pl.assemble(scores, neg_inf_row, [b, 0])

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py (1)

36-37: ⚠️ Potential issue | 🟠 Major

MAX_SEQ diverges from decode_front_scope3.py (4096 vs 8192).

decode_front_scope3.py (the real score+topk implementation intended to replace Stage 3's stub) declares MAX_SEQ = 8192, while this file uses 4096. Because CACHE_ROWS = BATCH * MAX_SEQ and k_cache_idx / kv_cache / pe_cache are sized off this constant, the two scopes cannot be wired together as-is — cache shapes won't line up. Please align the constants (ideally by extracting them into a shared module) before integrating scope3.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 36
- 37, MAX_SEQ in deepseek_v3_2_decode_front.py is set to 4096 while
decode_front_scope3.py uses 8192, causing mismatched cache shapes because
CACHE_ROWS = BATCH * MAX_SEQ and k_cache_idx / kv_cache / pe_cache are sized
from it; fix by making the two files use the same constant (preferably extract
BATCH and MAX_SEQ into a shared config module and import them in both files) and
then update CACHE_ROWS and any cache allocation logic (k_cache_idx, kv_cache,
pe_cache) to reference the shared constants so shapes line up when wiring
scope3.
🧹 Nitpick comments (2)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py (2)

567-569: Inconsistent chunk= values across parallel loops.

Other parallel-over-batch loops in this file use chunk=BATCH // BATCH_TILE (= 4) or chunk=BATCH, while scope 4's outer batch loop and the dispatch loop use a literal chunk=4 (line 569, 658), and the inner head loop uses a literal chunk=8 (line 574). That happens to equal BATCH // BATCH_TILE today, but if BATCH or BATCH_TILE changes, scope 4 silently drifts out of sync. Prefer deriving these from the existing constants (e.g., chunk=BATCH // BATCH_TILE, chunk=NUM_HEADS // HEAD_TILE).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 567
- 569, The parallel loops use hard-coded chunk sizes that can diverge from
constants; update the pl.parallel calls around attn_front and related loops (the
outer batch loop, the dispatch loop, and the inner head loop) to derive chunk
values from the existing constants instead of literals — use chunk=BATCH //
BATCH_TILE for batch-level loops and chunk=NUM_HEADS // HEAD_TILE for head-level
loops (or chunk=BATCH where other loops use full-batch tiling) so all
pl.parallel(...) invocations remain consistent with BATCH, BATCH_TILE,
NUM_HEADS, and HEAD_TILE.

522-564: Stage 3 computes scores that are then thrown away.

Stages 3.1a/3.1b materialize the full [BATCH, MAX_SEQ] scores tensor, but Stage 3.2 ignores it and produces topk_idx as all zeros. Only topk_idx feeds scope 4, so the scoring work is currently dead on-device compute and transient GM. Two options, pick whichever fits your timeline:

  1. Guard the scoring under an if 0: / behind a config flag until the real topk lands, to save kernel compile/run time during iteration.
  2. Integrate deepseek_v3_2_decode_front_scope3.py here now — it already emits topk_vals_out/topk_idx_out with the INT32_MIN padding convention that matches the topk_pos >= 0 filter in scope 4 (line 607).

Additionally, note that scores here is never explicitly initialized to -inf before Stage 3.1b writes only the [0, ctx_blocks) range; decode_front_scope3.py:100-105 has exactly the pre-fill you'll need once topk is real.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 522
- 564, The scoring loop currently computes a full scores tensor (scores) which
is never used because topk_idx is hardcoded to zeros, wasting on-device compute
and memory; either guard the scoring stage behind a config flag or if 0 to skip
it during iteration, or replace the placeholder with the real scope3
implementation (import/inline logic from deepseek_v3_2_decode_front_scope3.py to
produce topk_vals_out/topk_idx_out using the INT32_MIN padding convention
consumed by the topk_pos >= 0 filter in scope4). Also explicitly initialize
scores to -inf (use PadValue.min or the same pre-fill used in
decode_front_scope3.py) before Stage 3.1 so partial writes and fillpad behavior
are correct once a real topk is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py`:
- Around line 35-49: The constants BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK,
and CACHE_ROWS_IDX in deepseek_v3_2_decode_front_scope3.py must match
deepseek_v3_2_decode_front.py to avoid silent shape mismatches (k_cache_idx);
either extract these shared constants into a common module (e.g., shared_config)
and import them in both files, or add a runtime assertion in
deepseek_v3_2_decode_front_scope3.py that compares its BATCH, MAX_SEQ,
INDEX_HEAD_DIM, INDEX_TOPK, CACHE_ROWS_IDX against the values from
deepseek_v3_2_decode_front.py and fails fast with a clear message; update
references to CACHE_ROWS_IDX = BATCH * MAX_SEQ to use the shared constant so
both scopes remain consistent.
- Line 64: The kernel signature currently declares idx_init as pl.UINT32 but the
code initializes it with torch.int32 and uses INT32_MIN sentinel values; update
the kernel parameter type for idx_init to pl.INT32 to match the actual data and
the TensorSpec (and to align with topk_idx_out which is pl.INT32); locate the
declaration of idx_init in the kernel signature and change its type from
pl.UINT32 to pl.INT32, and ensure any associated TensorSpec references (e.g.,
where idx_init is described) are consistent with INT32.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 560-634: Scope 4 uses a zeroed topk_idx stub so the inner loop
repeatedly attends to position 0; to fix, either (A) short-circuit the inner
loop when topk_idx is a stub by capping sparse_k to 1 (i.e., set sparse_k =
pl.minimum(sparse_k, 1) or otherwise force a single kk iteration) to avoid
redundant updates, or (B) wire in the real index generation by invoking
decode_front_scope3 (or the routine that fills topk_idx) before Scope 4 so
topk_idx contains real positions; locate uses of topk_idx, sparse_k, topk_pos
and cache_s in Scope 4 and implement one of these fixes so the loop does not
repeatedly read cache row b*MAX_SEQ+0.
- Around line 206-211: The comment in deepseek_v3_2_decode_front.py incorrectly
self-references lines 241-274; update the comment in the RoPE block (the Stage
1.5 / MLA RoPE comment) to reference the actual prior implementation file or
module (for example ds_q0_rope.py or the correct renamed source) or remove the
line-numbered self-reference altogether; ensure the comment mentions the correct
symbol/module that contains the monolithic implementation so readers can find
the original lo/hi half split convention.

---

Outside diff comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 36-37: MAX_SEQ in deepseek_v3_2_decode_front.py is set to 4096
while decode_front_scope3.py uses 8192, causing mismatched cache shapes because
CACHE_ROWS = BATCH * MAX_SEQ and k_cache_idx / kv_cache / pe_cache are sized
from it; fix by making the two files use the same constant (preferably extract
BATCH and MAX_SEQ into a shared config module and import them in both files) and
then update CACHE_ROWS and any cache allocation logic (k_cache_idx, kv_cache,
pe_cache) to reference the shared constants so shapes line up when wiring
scope3.

---

Nitpick comments:
In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 567-569: The parallel loops use hard-coded chunk sizes that can
diverge from constants; update the pl.parallel calls around attn_front and
related loops (the outer batch loop, the dispatch loop, and the inner head loop)
to derive chunk values from the existing constants instead of literals — use
chunk=BATCH // BATCH_TILE for batch-level loops and chunk=NUM_HEADS // HEAD_TILE
for head-level loops (or chunk=BATCH where other loops use full-batch tiling) so
all pl.parallel(...) invocations remain consistent with BATCH, BATCH_TILE,
NUM_HEADS, and HEAD_TILE.
- Around line 522-564: The scoring loop currently computes a full scores tensor
(scores) which is never used because topk_idx is hardcoded to zeros, wasting
on-device compute and memory; either guard the scoring stage behind a config
flag or if 0 to skip it during iteration, or replace the placeholder with the
real scope3 implementation (import/inline logic from
deepseek_v3_2_decode_front_scope3.py to produce topk_vals_out/topk_idx_out using
the INT32_MIN padding convention consumed by the topk_pos >= 0 filter in
scope4). Also explicitly initialize scores to -inf (use PadValue.min or the same
pre-fill used in decode_front_scope3.py) before Stage 3.1 so partial writes and
fillpad behavior are correct once a real topk is used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72f01501-50d0-4ec4-8c70-8c6e49e07647

📥 Commits

Reviewing files that changed from the base of the PR and between 14670ee and f54a3bd.

📒 Files selected for processing (2)
  • examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py
  • examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

Comment on lines +35 to +49
BATCH = 16
MAX_SEQ = 8192
INDEX_HEAD_DIM = 128
INDEX_TOPK = 2048
CACHE_ROWS_IDX = BATCH * MAX_SEQ

SEQ_TILE = 64
MAX_SEQ_BLOCKS = (MAX_SEQ + SEQ_TILE - 1) // SEQ_TILE

# Q pad: a2a3 TExtract requires row % 16 == 0, so pad the 1-row query to 16.
Q_VALID = 1
Q_PAD = 16

# sort32 + 4 mrgsort iterations (block_len 64,256,1024,4096) sort MAX_SEQ=8192.
MRGSORT_ITERS = 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MAX_SEQ = 8192 diverges from deepseek_v3_2_decode_front.py (4096).

The AI summary calls out that this file provides the real topk intended to replace the stub in ds32exp_decode_front. For that integration to work, BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK, and CACHE_ROWS_IDX must match between the two files. Today they don't (MAX_SEQ in particular). Please extract the shared constants into a common module (or at least add a matching assertion) so the two scopes can be composed without a silent shape mismatch on k_cache_idx.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py` around
lines 35 - 49, The constants BATCH, MAX_SEQ, INDEX_HEAD_DIM, INDEX_TOPK, and
CACHE_ROWS_IDX in deepseek_v3_2_decode_front_scope3.py must match
deepseek_v3_2_decode_front.py to avoid silent shape mismatches (k_cache_idx);
either extract these shared constants into a common module (e.g., shared_config)
and import them in both files, or add a runtime assertion in
deepseek_v3_2_decode_front_scope3.py that compares its BATCH, MAX_SEQ,
INDEX_HEAD_DIM, INDEX_TOPK, CACHE_ROWS_IDX against the values from
deepseek_v3_2_decode_front.py and fails fast with a clear message; update
references to CACHE_ROWS_IDX = BATCH * MAX_SEQ to use the shared constant so
both scopes remain consistent.

q_idx: pl.Tensor[[BATCH, INDEX_HEAD_DIM], pl.BF16],
k_cache_idx: pl.Tensor[[CACHE_ROWS_IDX, INDEX_HEAD_DIM], pl.BF16],
seq_lens: pl.Tensor[[BATCH], pl.INT32],
idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm how idx_init is declared/typed elsewhere in the repo, and how sort32
# expects its index operand.
rg -nP --type=py -C3 '\bidx_init\b'
rg -nP --type=py -C3 'pl\.tile\.sort32\s*\('

Repository: hw-native-sys/pypto-lib

Length of output: 3726


🏁 Script executed:

# Check lines around 236 for init_idx_init initialization
sed -n '230,245p' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

# Search for INT32_MIN in the file
rg -n 'INT32_MIN|INT32MIN|int32_min|int.min' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

# Check the kernel implementation for any sentinel values
rg -n 'sentinel|INT32' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

Repository: hw-native-sys/pypto-lib

Length of output: 1638


🏁 Script executed:

# Search for sort32 definition or documentation
rg -n 'def sort32|sort32.*signature|sort32.*idx' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

# Check if there are other kernel declarations using similar patterns
rg -n 'pl\.Tensor.*pl\.(INT32|UINT32)' examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

Repository: hw-native-sys/pypto-lib

Length of output: 379


Change idx_init from UINT32 to INT32 in both kernel signature (line 64) and TensorSpec (line 253).

The kernel declares idx_init as pl.UINT32, but it is initialized with torch.arange(MAX_SEQ, dtype=torch.int32) (line 236) and defined as torch.int32 in the TensorSpec (line 253). More critically, the algorithm uses INT32_MIN as a sentinel value for padding (lines 212, 175, 226), which cannot be represented in an unsigned integer. Align to INT32 to match the actual data type, enable the sentinel value, and maintain consistency with topk_idx_out which is already declared as INT32.

Changes needed

Line 64:

-            idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
+            idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32],

Line 253:

-        TensorSpec("idx_init", [1, MAX_SEQ], torch.int32, init_value=init_idx_init),
+        TensorSpec("idx_init", [1, MAX_SEQ], torch.int32, init_value=init_idx_init),

(Line 253 is already correct and requires no change.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32],
idx_init: pl.Tensor[[1, MAX_SEQ], pl.INT32],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py` at line
64, The kernel signature currently declares idx_init as pl.UINT32 but the code
initializes it with torch.int32 and uses INT32_MIN sentinel values; update the
kernel parameter type for idx_init to pl.INT32 to match the actual data and the
TensorSpec (and to align with topk_idx_out which is pl.INT32); locate the
declaration of idx_init in the kernel signature and change its type from
pl.UINT32 to pl.INT32, and ensure any associated TensorSpec references (e.g.,
where idx_init is described) are consistent with INT32.

Comment on lines +206 to +211
# Stage 1.5: q_pe RoPE on every MLA head, k_pe RoPE on kv_a, kv_norm,
# write kv_cache and pe_cache at row b*MAX_SEQ + (seq_lens[b]-1).
# NOTE: official applies interleaved=True for MLA, but the existing
# decode/prefill paths in this repo use the lo/hi half split form
# (see deepseek_v3_2_decode_front.py:241-274). We follow the same
# convention so the cached pe matches the in-tree consumers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale self-reference in the comment.

"the existing decode/prefill paths in this repo use the lo/hi half split form (see deepseek_v3_2_decode_front.py:241-274)" — this is deepseek_v3_2_decode_front.py, and the referenced line range points inside the RoPE block you're writing, not the prior monolithic implementation you intend to cite. Please update the reference to point to the actual source (e.g., ds_q0_rope.py, or whatever remained after the rename).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 206
- 211, The comment in deepseek_v3_2_decode_front.py incorrectly self-references
lines 241-274; update the comment in the RoPE block (the Stage 1.5 / MLA RoPE
comment) to reference the actual prior implementation file or module (for
example ds_q0_rope.py or the correct renamed source) or remove the line-numbered
self-reference altogether; ensure the comment mentions the correct symbol/module
that contains the monolithic implementation so readers can find the original
lo/hi half split convention.

Comment on lines +560 to 634
topk_idx = pl.create_tensor([BATCH, INDEX_TOPK], dtype=pl.INT32)
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer):
for b in pl.parallel(0, BATCH, 1, chunk=BATCH):
zero_row = pl.full([1, INDEX_TOPK], dtype=pl.INT32, value=0)
topk_idx = pl.assemble(topk_idx, zero_row, [b, 0])

# ── Scope 4: post topk (sparse MQA + dispatch) ──
attn_front = pl.create_tensor([BATCH, ATTN_OUT], dtype=pl.FP32)
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer):
for b in pl.parallel(0, BATCH, 1, chunk=4):
ctx_len = pl.tensor.read(seq_lens, [b])
sparse_k = pl.min(INDEX_TOPK, ctx_len)
attn_row = pl.full([1, ATTN_OUT], dtype=pl.FP32, value=0.0)

for h in pl.parallel(0, NUM_HEADS, 1, chunk=8):
q_col = h * QK_HEAD_DIM
# q_pe was already RoPE-rotated in Scope 1, so we read it
# back as-is. q_nope is projected to the latent space
# via per-head w_q_nope_to_latent.
q_nope = pl.cast(
pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]),
pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM], [b, q_col]),
target_type=pl.FP32,
)
q_pe = pl.cast(
pl.slice(q_proj, [1, QK_ROPE_HEAD_DIM_CFG], [b, q_col + QK_NOPE_HEAD_DIM_CFG]),
pl.slice(
q_proj, [1, QK_ROPE_HEAD_DIM], [b, q_col + QK_NOPE_HEAD_DIM]
),
target_type=pl.FP32,
)
q_lo = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, 0])
q_hi = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, QK_ROPE_HEAD_DIM_CFG // 2])
q_rot = pl.create_tensor([1, QK_ROPE_HEAD_DIM_CFG], dtype=pl.FP32)
q_rot = pl.assemble(
q_rot,
pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo)),
[0, 0],
)
q_rot = pl.assemble(
q_rot,
pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi)),
[0, QK_ROPE_HEAD_DIM_CFG // 2],
w_qn_h = pl.reshape(
pl.slice(
w_q_nope_to_latent,
[1, QK_NOPE_HEAD_DIM, KV_LORA_RANK],
[h, 0, 0],
),
[QK_NOPE_HEAD_DIM, KV_LORA_RANK],
)
q_nope_latent = pl.matmul(
pl.cast(q_nope, target_type=pl.BF16),
pl.reshape(
pl.slice(
w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]
),
[QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG],
),
pl.cast(q_nope, target_type=pl.BF16), w_qn_h, out_dtype=pl.FP32
)

oi = pl.create_tensor([1, KV_LORA_RANK_CFG], dtype=pl.FP32)
li = pl.create_tensor([1, 1], dtype=pl.FP32)
mi = pl.create_tensor([1, 1], dtype=pl.FP32)
oi = pl.mul(oi, 0.0)
li = pl.mul(li, 0.0)
mi = pl.mul(mi, 0.0)
oi = pl.full([1, KV_LORA_RANK], dtype=pl.FP32, value=0.0)
li = pl.full([1, 1], dtype=pl.FP32, value=0.0)
mi = pl.full([1, 1], dtype=pl.FP32, value=0.0)

sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len)
for kk in pl.range(sparse_k):
topk_pos = pl.tensor.read(topk_idx, [0, kk])
topk_pos = pl.tensor.read(topk_idx, [b, kk])
if topk_pos >= 0:
cache_s = b * MAX_SEQ_CFG + topk_pos
cache_s = b * MAX_SEQ + topk_pos
kv_s = pl.cast(
pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]),
pl.slice(kv_cache, [1, KV_LORA_RANK], [cache_s, 0]),
target_type=pl.FP32,
)
pe_s = pl.cast(
pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]),
pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM], [cache_s, 0]),
target_type=pl.FP32,
)
score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s))
score_pe = pl.row_sum(pl.mul(q_rot, pe_s))
score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
cur_mi = score
cur_li = pl.exp(pl.sub(score, cur_mi))
oi_tmp = pl.row_expand_mul(kv_s, cur_li)
score_pe = pl.row_sum(pl.mul(q_pe, pe_s))
cur_mi = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
cur_li = pl.full([1, 1], dtype=pl.FP32, value=1.0)
if kk == 0:
oi = oi_tmp
oi = kv_s
li = cur_li
mi = cur_mi
else:
mi_new = pl.maximum(mi, cur_mi)
alpha = pl.exp(pl.sub(mi, mi_new))
beta = pl.exp(pl.sub(cur_mi, mi_new))
li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li))
oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta))
oi = pl.add(
pl.row_expand_mul(oi, alpha),
pl.row_expand_mul(kv_s, beta),
)
mi = mi_new
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Scope 4 silently degenerates to "attend only to position 0" with the current stub.

With topk_idx all-zero and topk_pos = 0 >= 0 every iteration, the inner loop (lines 605–634) invokes the online softmax update sparse_k = min(INDEX_TOPK, ctx_len) times against the same kv_s/pe_s at cache_row = b*MAX_SEQ + 0. The math is self-consistent (α=β=1 each step so ctx_latent == kv_s[0]) and matches the golden, so the test will pass, but the result is not meaningful attention.

This is documented as a TODO, but I'd flag two risks:

  • If someone benchmarks this program they'll be measuring sparse_k redundant online-softmax updates against position 0, not real sparse MQA; perf numbers may be misleading.
  • The placeholder also hides any bug in pl.tensor.read(topk_idx, [b, kk]) / bounds behavior that would show up only once topk_idx has non-zero entries.

Consider at minimum short-circuiting scope 4 to a single kk=0 step while the stub is in place, or wiring in decode_front_scope3 so scope 4 runs against real indices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 560
- 634, Scope 4 uses a zeroed topk_idx stub so the inner loop repeatedly attends
to position 0; to fix, either (A) short-circuit the inner loop when topk_idx is
a stub by capping sparse_k to 1 (i.e., set sparse_k = pl.minimum(sparse_k, 1) or
otherwise force a single kk iteration) to avoid redundant updates, or (B) wire
in the real index generation by invoking decode_front_scope3 (or the routine
that fills topk_idx) before Scope 4 so topk_idx contains real positions; locate
uses of topk_idx, sparse_k, topk_pos and cache_s in Scope 4 and implement one of
these fixes so the loop does not repeatedly read cache row b*MAX_SEQ+0.

@zhangqi-chen zhangqi-chen merged commit da2f2d2 into hw-native-sys:main Apr 21, 2026
6 checks passed
@zhangqi-chen zhangqi-chen deleted the feat/ds32-decode-front-scope3 branch April 21, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant