Add manual-scope v0 to tensormap runtime#568
Add manual-scope v0 to tensormap runtime#568uv-xiao wants to merge 4 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.
| const Tensor &mi_update = alloc_outs.get_ref(2); | ||
|
|
||
| for (uint64_t bn = 0; bn < bn_this_batch; bn++) { | ||
| PTO2_SCOPE_GUARD(); |
There was a problem hiding this comment.
The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.
To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.
PTO2_SCOPE_GUARD guard;References
- Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.
| @@ -0,0 +1,423 @@ | |||
| # Manual Scope V0 Design | |||
|
|
|||
| Date: 2026-04-15 | |||
|
Latest update on top of the current PR branch: What changed
Fresh benchmark results30 rounds, trimmed average, device
Fresh TensorMap profilingNon-unroll
What these numbers show:
The design note in |
|
Alignment update against
I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (
So the current state is:
|
5e4de6a to
84763d8
Compare
|
Update after rerunning the full batch on real device. Cleaned stateThis PR is still the same narrow v0 scope only:
The branch is kept as 3 logical commits:
Fresh real-device benchmarkRerun on Golden status:
Reading of the current batch:
TensorMap lookup / insert comparisonProfiling comparison from the current manual-scope implementation on non-unroll
What this still shows:
|
- Add manual scope mode and explicit Arg dependency plumbing - Attach submit-result task ids independently from output tensors - Bypass TensorMap lookup and insert while manual scope is active - Keep the runtime/examples/docs scope without adding test changes
- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path
- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes
84763d8 to
0e0d3ee
Compare
- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path - Gate alloc-task retention with is_first/is_last instead of attaching it on every update - Verified with fresh hardware golden and 100-round reruns on device 9
|
Small follow-up after aligning the two manual-scope paged-attention examples. What changedThe non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:
This removes the older conservative pattern where non-unroll attached Targeted real-device rerunRerun only the affected non-unroll manual example on Golden:
100-round trimmed benchmark for the affected rows:
So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path. |
Summary
manual_scope v0mode toa2a3/tensormap_and_ringbufferwithout introducing a separate manual submit API familyArg.add_dep(task_id)docs/manual-scope-v0-design.mdWhat Is In Scope
This PR is the narrow v0 version.
PTO2_SCOPE()stays AUTO by defaultPTO2_SCOPE(PTO2ScopeMode::MANUAL)enables manual modepto2_rt_submit_aic_task(...)pto2_rt_submit_aiv_task(...)pto2_rt_submit_task(...)Arg.add_dep(task_id)alloc_tensors(...)remains output-only but returns a producer task idWhat Is Explicitly Out Of Scope
*_manual(...)submit APIsscope_end()Core Runtime Model
Inside manual scope, the submit path is still the AUTO submit path plus two manual rules:
Argare validated and materialized as ordinary fanins before publishBoundary behavior stays explicit:
Arg.add_dep(...)must point to tasks from the current top scopeArg.add_dep(...)is still allowed for boundary edges from earlier producersThe runtime state for v0 is intentionally small:
manual_begin_depthUser-Facing Shape
Manual scope uses the same submit calls as AUTO mode:
For repeated zero-output updater chains, the standalone task id is the key:
Validation And Benchmarks
Primary design note:
docs/manual-scope-v0-design.mdFresh real-device validation was rerun on
a2a3, device9, PTO-ISAd96c8784.Golden status:
paged_attention:Case1PASS,Case2PASSpaged_attention_manual_scope:Case1PASS,Case2PASSpaged_attention_unroll:Case1PASS,Case2PASSpaged_attention_unroll_manual_scope:Case1PASS,Case2PASSpaged_attention:Case1PASS,Case2PASSpaged_attention_unroll:Case1PASS,Case2FAILFresh 100-round trimmed benchmark:
paged_attentionCase1paged_attentionCase2paged_attention_unrollCase1paged_attention_unrollCase2Reading of the current batch:
Case2is not a correctness-clean baseline in this rerunTesting
ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failurepython -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -qa2a3, device9