Add manual-scope v0 to tensormap runtime by uv-xiao · Pull Request #568 · hw-native-sys/simpler

uv-xiao · 2026-04-15T08:01:32Z

Summary

add a lighter manual_scope v0 mode to a2a3/tensormap_and_ringbuffer without introducing a separate manual submit API family
keep AUTO-mode submit entry points and express explicit ordering through Arg.add_dep(task_id)
publish tasks at submit time; do not add delayed wiring, delayed linking, or manual scope-end replay
carry standalone submit-result task ids so zero-output updater chains can be expressed explicitly
bypass TensorMap lookup/insert for current-manual-scope-local tensors while keeping explicit boundary-edge support
add manual-scope validation coverage, paged-attention manual-scope examples, and the design/benchmark note in docs/manual-scope-v0-design.md

What Is In Scope

This PR is the narrow v0 version.

PTO2_SCOPE() stays AUTO by default
PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables manual mode
submit APIs stay unchanged:
- pto2_rt_submit_aic_task(...)
- pto2_rt_submit_aiv_task(...)
- pto2_rt_submit_task(...)
explicit ordering is expressed through Arg.add_dep(task_id)
manual tasks are published immediately at submit time
alloc_tensors(...) remains output-only but returns a producer task id

What Is Explicitly Out Of Scope

no separate *_manual(...) submit APIs
no post-submit dependency API
no delayed wiring or delayed linking
no batch publish barrier at scope_end()
no nested manual scopes in v0
no TensorMap fallback for current-manual-scope-local tensors

Core Runtime Model

Inside manual scope, the submit path is still the AUTO submit path plus two manual rules:

explicit deps from Arg are validated and materialized as ordinary fanins before publish
TensorMap lookup and TensorMap insert are skipped for current-manual-scope-local tensors

Boundary behavior stays explicit:

inside manual scope, Arg.add_dep(...) must point to tasks from the current top scope
outside manual scope, Arg.add_dep(...) is still allowed for boundary edges from earlier producers
creator retention still uses the existing tensor ownership metadata

The runtime state for v0 is intentionally small:

manual behavior is controlled by manual_begin_depth
older per-tensor / per-slot manual metadata from the previous branch has been removed

User-Facing Shape

Manual scope uses the same submit calls as AUTO mode:

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    Arg qk = make_qk_args(...);
    auto qk_out = pto2_rt_submit_aic_task(FUNC_QK_MATMUL, qk);

    Arg sf = make_sf_args(...);
    sf.add_dep(qk_out.task_id);
    auto sf_out = pto2_rt_submit_aiv_task(FUNC_SOFTMAX_PREPARE, sf);

    Arg pv = make_pv_args(...);
    pv.add_dep(sf_out.task_id);
    auto pv_out = pto2_rt_submit_aic_task(FUNC_PV_MATMUL, pv);

    Arg up = make_update_args(...);
    up.add_dep(sf_out.task_id);
    up.add_dep(pv_out.task_id);
    (void)pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
}

For repeated zero-output updater chains, the standalone task id is the key:

PTO2TaskId prev_update = PTO2TaskId::invalid();

for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    TaskOutputTensors update_out = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
    prev_update = update_out.task_id();
}

Validation And Benchmarks

Primary design note:

docs/manual-scope-v0-design.md

Fresh real-device validation was rerun on a2a3, device 9, PTO-ISA d96c8784.

Golden status:

TMR AUTO paged_attention: Case1 PASS, Case2 PASS
TMR manual paged_attention_manual_scope: Case1 PASS, Case2 PASS
TMR AUTO paged_attention_unroll: Case1 PASS, Case2 PASS
TMR manual paged_attention_unroll_manual_scope: Case1 PASS, Case2 PASS
ABG paged_attention: Case1 PASS, Case2 PASS
ABG paged_attention_unroll: Case1 PASS, Case2 FAIL

Fresh 100-round trimmed benchmark:

Example	Case	TMR AUTO Elapsed (us)	TMR AUTO Orch (us)	TMR Manual Elapsed (us)	TMR Manual Orch (us)	ABG Elapsed (us)	Notes
`paged_attention`	`Case1`	77.9	62.1	124.5	108.3	31625.5	all correctness checks pass
`paged_attention`	`Case2`	93.9	72.9	141.9	118.9	16611.7	all correctness checks pass
`paged_attention_unroll`	`Case1`	1135.8	762.2	1124.6	638.3	1384.1	all correctness checks pass
`paged_attention_unroll`	`Case2`	517.1	305.9	494.5	251.2	675.7	ABG golden fails

Reading of the current batch:

non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
unroll manual scope is slightly faster than TMR AUTO on both kept cases
ABG unroll Case2 is not a correctness-clean baseline in this rerun

Testing

ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failure
python -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -q
real-device golden reruns on a2a3, device 9
real-device 100-round timing reruns with device-log parsing for the paths above

gemini-code-assist

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

gemini-code-assist · 2026-04-15T08:06:02Z

                const Tensor &mi_update = alloc_outs.get_ref(2);
-
                for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
+                    PTO2_SCOPE_GUARD();


The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.

To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.

PTO2_SCOPE_GUARD guard;

References

Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.

gemini-code-assist · 2026-04-15T08:06:02Z

@@ -0,0 +1,423 @@
+# Manual Scope V0 Design
+
+Date: 2026-04-15


The date 2026-04-15 appears to be a typo, as it is in the future. Please consider correcting it.

uv-xiao · 2026-04-15T13:09:28Z

Latest update on top of the current PR branch:

What changed

TaskOutputTensors now carries a standalone task_id, so submit/alloc results expose a usable task-id handle even for zero-output updater tasks.
The runtime now fully bypasses TensorMap lookup and insert for current-manual-scope-local tensors; boundary tensors still keep the conservative TensorMap path.
The paged-attention manual-scope examples now explicitly chain repeated updater tasks with prev_update_task instead of relying on manual-local TensorMap fallback.
The design note was refreshed to match the implemented semantics, fresh validation, benchmark results, and the new TensorMap profiling breakdown.

Fresh benchmark results

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example	Case	Auto Elapsed (us)	Auto Orch (us)	Manual Elapsed (us)	Manual Orch (us)	Elapsed Delta	Orch Delta
`paged_attention`	`Case1`	77.5	62.4	124.0	108.6	+46.5	+46.2
`paged_attention`	`Case2`	96.2	74.1	146.7	124.6	+50.5	+50.5
`paged_attention_unroll`	`Case1`	1138.4	800.7	1131.3	694.9	-7.1	-105.8
`paged_attention_unroll`	`Case2`	519.7	319.1	511.3	282.6	-8.4	-36.5

Fresh TensorMap profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild.

Case	Mode	`lookup+dep` (us)	`tensormap_ins` (us)	Lookups	Inserts	Full Orch (us)
`Case1`	AUTO	4.132	1.842	40.0	12.0	194.508
`Case1`	MANUAL	1.944	1.414	16.0	3.0	259.318
`Case2`	AUTO	6.320	2.638	105.0	32.0	210.274
`Case2`	MANUAL	2.598	1.728	41.0	8.0	285.182

What these numbers show:

the manual-local TensorMap bypass is working: lookups dropped about 60%, inserts dropped about 75%, and lookup+dep time dropped about 53% to 59%
the remaining non-unroll regression is not coming from TensorMap lookup / insert anymore
the next optimization target should be explicit-dep construction in orchestration and explicit-dep validation / dedupe in the runtime submit path

The design note in docs/manual-scope-v0-design.md now records these results and the corresponding interpretation.

uv-xiao · 2026-04-20T05:46:44Z

Alignment update against poursoul/manual_scope (dd76880):

The current manual_scope_v0 branch now follows the same hot-path structure in the main places that matter:
- pto2_prepare_task() prebinds next_task_id() / next_task_slot() and zero-inits slot state with memset
- TaskOutputTensors carries task_id() directly
- payload->init(...) materializes outputs from PTO2TaskAllocResult / PTO2OutputLayout
- manual-scope submit skips TensorMap lookup/insert and uses explicit deps directly
- the unroll manual example uses the same narrowed update dependency chain shape
The intentional difference we keep is allocator-failure signaling:
- this branch keeps the old negative sentinel {-1, -1, nullptr, nullptr}
- poursoul’s visible patch returned {0, 0, nullptr, nullptr} while failed() still checked task_id < 0
- we kept the negative sentinel because task id 0 is valid, so the zero-sentinel form is internally inconsistent unless the failure check also changes

I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (a2a3, device 9, 100 rounds, PTO-ISA d96c8784). The result is that the two are materially the same; there is no clear evidence that our current branch is slower because we mis-merged the colleague design.

Example	Case	Current Elapsed / Orch	Poursoul Elapsed / Orch	Poursoul vs Current
`paged_attention`	`Case1`	`74.6 / 59.2`	`77.3 / 61.4`	`+3.6% / +3.7%`
`paged_attention`	`Case2`	`93.2 / 72.3`	`93.1 / 71.9`	`-0.1% / -0.6%`
`paged_attention_manual_scope`	`Case1`	`118.1 / 101.3`	`120.9 / 105.6`	`+2.4% / +4.2%`
`paged_attention_manual_scope`	`Case2`	`138.6 / 115.3`	`128.7 / 115.7`	`-7.1% / +0.4%`
`paged_attention_unroll`	`Case1`	`1135.2 / 774.1`	`1140.1 / 766.9`	`+0.4% / -0.9%`
`paged_attention_unroll`	`Case2`	`516.0 / 305.5`	`520.2 / 319.1`	`+0.8% / +4.5%`
`paged_attention_unroll_manual_scope`	`Case1`	`1129.6 / 651.7`	`1128.9 / 646.2`	`-0.1% / -0.8%`
`paged_attention_unroll_manual_scope`	`Case2`	`494.3 / 253.6`	`495.4 / 252.5`	`+0.2% / -0.4%`

So the current state is:

alignment with poursoul’s design is in place for the main runtime/example paths
performance is effectively the same between the two branches within normal run-to-run noise for most rows
the remaining non-unroll gap vs AUTO is a real manual-scope-v0 issue, not just a divergence from poursoul’s branch

uv-xiao · 2026-04-20T17:42:31Z

Update after rerunning the full batch on real device.

Cleaned state

This PR is still the same narrow v0 scope only:

runtime/API support for PTO2_SCOPE(PTO2ScopeMode::MANUAL)
explicit deps through Arg.add_dep(task_id)
submit-time publish only
standalone submit-result task_id for zero-output updater chaining
TensorMap lookup/insert bypass for current-manual-scope-local tensors
two a2a3 manual-scope examples:
- paged_attention_manual_scope
- paged_attention_unroll_manual_scope
unit / sim validation for the v0 rules

The branch is kept as 3 logical commits:

runtime support
paged-attention examples
design / benchmark doc

Fresh real-device benchmark

Rerun on a2a3, device 9, PTO-ISA d96c8784, 100 rounds, trimmed average.

Golden status:

TMR AUTO/manual all PASS on the two paged-attention workloads
ABG paged_attention PASS
ABG paged_attention_unroll Case2 FAIL again, so that ABG row is still not correctness-clean

Example	Case	TMR AUTO Elapsed (us)	TMR AUTO Orch (us)	TMR Manual Elapsed (us)	TMR Manual Orch (us)	ABG Elapsed (us)	Notes
`paged_attention`	`Case1`	73.4	60.2	119.6	104.8	31385.1	all correctness checks pass
`paged_attention`	`Case2`	93.9	73.3	137.1	114.6	16429.4	all correctness checks pass
`paged_attention_unroll`	`Case1`	1137.0	772.7	1132.2	647.4	1383.3	all correctness checks pass
`paged_attention_unroll`	`Case2`	523.0	317.7	492.6	251.2	676.7	ABG golden fails

Reading of the current batch:

non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
unroll manual scope is still slightly faster than TMR AUTO on both kept cases
ABG paged_attention_unroll Case2 remains an unstable / not correctness-clean baseline in reruns

TensorMap lookup / insert comparison

Profiling comparison from the current manual-scope implementation on non-unroll paged_attention:

Case	Mode	`lookup+dep` Trim (us)	`tensormap_ins` Trim (us)	TensorMap Lookups Avg	TensorMap Inserts Avg	Full Orch Trim (us)
`Case1`	AUTO	4.132	1.842	40.0	12.0	194.508
`Case1`	MANUAL	1.944	1.414	16.0	3.0	259.318
`Case2`	AUTO	6.320	2.638	105.0	32.0	210.274
`Case2`	MANUAL	2.598	1.728	41.0	8.0	285.182

What this still shows:

the manual-local TensorMap bypass is working
lookup / insert traffic is materially reduced in manual mode
the remaining non-unroll gap is not explained by TensorMap lookup / insert alone
the remaining cost is in the explicit-dep / orchestration path, not the old TensorMap path

- Add manual scope mode and explicit Arg dependency plumbing - Attach submit-result task ids independently from output tensors - Bypass TensorMap lookup and insert while manual scope is active - Keep the runtime/examples/docs scope without adding test changes

- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path

- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes

- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path - Gate alloc-task retention with is_first/is_last instead of attaching it on every update - Verified with fresh hardware golden and 100-round reruns on device 9

uv-xiao · 2026-04-21T04:49:20Z

Small follow-up after aligning the two manual-scope paged-attention examples.

What changed

The non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:

every update keeps the direct pv_outs.task_id() producer edge
the first update depends on the allocation task
later updates depend on the previous update task
the last non-first update also retains the allocation task, matching the unroll path

This removes the older conservative pattern where non-unroll attached alloc_task to every update.

Targeted real-device rerun

Rerun only the affected non-unroll manual example on a2a3, device 9, PTO-ISA d96c8784, with --build.

Golden:

paged_attention_manual_scope Case1: PASS
paged_attention_manual_scope Case2: PASS

100-round trimmed benchmark for the affected rows:

Example	Case	Before Manual Elapsed (us)	Before Manual Orch (us)	After Manual Elapsed (us)	After Manual Orch (us)
`paged_attention_manual_scope`	`Case1`	119.6	104.8	117.3	102.8
`paged_attention_manual_scope`	`Case2`	137.1	114.6	133.8	112.4

So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path.

uv-xiao requested a review from poursoul April 15, 2026 08:03

uv-xiao marked this pull request as ready for review April 15, 2026 08:03

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

uv-xiao force-pushed the manual_scope_v0 branch from 5e4de6a to 84763d8 Compare April 20, 2026 09:17

uv-xiao added 3 commits April 21, 2026 01:47

Add: manual scope paged-attention examples

44dfbda

- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path

docs: add manual scope v0 design

0e0d3ee

- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes

uv-xiao force-pushed the manual_scope_v0 branch from 84763d8 to 0e0d3ee Compare April 20, 2026 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add manual-scope v0 to tensormap runtime#568

Add manual-scope v0 to tensormap runtime#568
uv-xiao wants to merge 4 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0

uv-xiao commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

uv-xiao commented Apr 15, 2026

Uh oh!

uv-xiao commented Apr 20, 2026

Uh oh!

uv-xiao commented Apr 20, 2026

Uh oh!

uv-xiao commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uv-xiao commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Is In Scope

What Is Explicitly Out Of Scope

Core Runtime Model

User-Facing Shape

Validation And Benchmarks

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

uv-xiao commented Apr 15, 2026

What changed

Fresh benchmark results

Fresh TensorMap profiling

Uh oh!

uv-xiao commented Apr 20, 2026

Uh oh!

uv-xiao commented Apr 20, 2026

Cleaned state

Fresh real-device benchmark

TensorMap lookup / insert comparison

Uh oh!

uv-xiao commented Apr 21, 2026

What changed

Targeted real-device rerun

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uv-xiao commented Apr 15, 2026 •

edited

Loading