Update: migrate scope3 sort/gather to tensor-level API and reduce MAX_SEQ#140
Conversation
…_SEQ - Replace pl.tile.sort32/mrgsort/gather + explicit pl.load/pl.store with pl.tensor.sort32/mrgsort/gather + pl.slice/pl.assemble, adapting to the new tensor-level ops merged in pypto (#1097) - Reduce MAX_SEQ from 8192 to 4096; introduce SORT_LEN=8192 to keep the sort buffer at full width — scores tensor is [BATCH, SORT_LEN] and Stage 0 fills the entire row with -inf so the [MAX_SEQ, SORT_LEN) tail is always -inf without an extra fillpad in the sort kernel - idx_init signature changed to pl.UINT32 (required by tensor.sort32); TensorSpec keeps torch.int32 (same bit layout, matches simpler runtime)
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe deepseek_v3_2 decode front scope3 kernel is restructured to use two buffer size constants: Changes
Possibly related PRs
Poem
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
pl.tile.sort32/mrgsort/gather+ explicitpl.load/pl.storewithpl.tensor.sort32/mrgsort/gather+pl.slice/pl.assemble, adapting to the new tensor-level ops merged in pypto (#1097)MAX_SEQfrom 8192 to 4096; introduceSORT_LEN=8192to keep the sort buffer at full width —scorestensor is[BATCH, SORT_LEN]and Stage 0 fills the entire row with-infso the[MAX_SEQ, SORT_LEN)tail is always-infwithout an extrafillpadin the sort kernelidx_initsignature changed topl.UINT32(required bytensor.sort32);TensorSpeckeepstorch.int32(same bit layout, matches simpler runtime)Related Issues
N/A