Skip to content

Add manual-scope v0 to tensormap runtime#568

Open
uv-xiao wants to merge 4 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0
Open

Add manual-scope v0 to tensormap runtime#568
uv-xiao wants to merge 4 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0

Conversation

@uv-xiao
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao commented Apr 15, 2026

Summary

  • add a lighter manual_scope v0 mode to a2a3/tensormap_and_ringbuffer without introducing a separate manual submit API family
  • keep AUTO-mode submit entry points and express explicit ordering through Arg.add_dep(task_id)
  • publish tasks at submit time; do not add delayed wiring, delayed linking, or manual scope-end replay
  • carry standalone submit-result task ids so zero-output updater chains can be expressed explicitly
  • bypass TensorMap lookup/insert for current-manual-scope-local tensors while keeping explicit boundary-edge support
  • add manual-scope validation coverage, paged-attention manual-scope examples, and the design/benchmark note in docs/manual-scope-v0-design.md

What Is In Scope

This PR is the narrow v0 version.

  • PTO2_SCOPE() stays AUTO by default
  • PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables manual mode
  • submit APIs stay unchanged:
    • pto2_rt_submit_aic_task(...)
    • pto2_rt_submit_aiv_task(...)
    • pto2_rt_submit_task(...)
  • explicit ordering is expressed through Arg.add_dep(task_id)
  • manual tasks are published immediately at submit time
  • alloc_tensors(...) remains output-only but returns a producer task id

What Is Explicitly Out Of Scope

  • no separate *_manual(...) submit APIs
  • no post-submit dependency API
  • no delayed wiring or delayed linking
  • no batch publish barrier at scope_end()
  • no nested manual scopes in v0
  • no TensorMap fallback for current-manual-scope-local tensors

Core Runtime Model

Inside manual scope, the submit path is still the AUTO submit path plus two manual rules:

  1. explicit deps from Arg are validated and materialized as ordinary fanins before publish
  2. TensorMap lookup and TensorMap insert are skipped for current-manual-scope-local tensors

Boundary behavior stays explicit:

  • inside manual scope, Arg.add_dep(...) must point to tasks from the current top scope
  • outside manual scope, Arg.add_dep(...) is still allowed for boundary edges from earlier producers
  • creator retention still uses the existing tensor ownership metadata

The runtime state for v0 is intentionally small:

  • manual behavior is controlled by manual_begin_depth
  • older per-tensor / per-slot manual metadata from the previous branch has been removed

User-Facing Shape

Manual scope uses the same submit calls as AUTO mode:

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    Arg qk = make_qk_args(...);
    auto qk_out = pto2_rt_submit_aic_task(FUNC_QK_MATMUL, qk);

    Arg sf = make_sf_args(...);
    sf.add_dep(qk_out.task_id);
    auto sf_out = pto2_rt_submit_aiv_task(FUNC_SOFTMAX_PREPARE, sf);

    Arg pv = make_pv_args(...);
    pv.add_dep(sf_out.task_id);
    auto pv_out = pto2_rt_submit_aic_task(FUNC_PV_MATMUL, pv);

    Arg up = make_update_args(...);
    up.add_dep(sf_out.task_id);
    up.add_dep(pv_out.task_id);
    (void)pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
}

For repeated zero-output updater chains, the standalone task id is the key:

PTO2TaskId prev_update = PTO2TaskId::invalid();

for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    TaskOutputTensors update_out = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
    prev_update = update_out.task_id();
}

Validation And Benchmarks

Primary design note:

  • docs/manual-scope-v0-design.md

Fresh real-device validation was rerun on a2a3, device 9, PTO-ISA d96c8784.

Golden status:

  • TMR AUTO paged_attention: Case1 PASS, Case2 PASS
  • TMR manual paged_attention_manual_scope: Case1 PASS, Case2 PASS
  • TMR AUTO paged_attention_unroll: Case1 PASS, Case2 PASS
  • TMR manual paged_attention_unroll_manual_scope: Case1 PASS, Case2 PASS
  • ABG paged_attention: Case1 PASS, Case2 PASS
  • ABG paged_attention_unroll: Case1 PASS, Case2 FAIL

Fresh 100-round trimmed benchmark:

Example Case TMR AUTO Elapsed (us) TMR AUTO Orch (us) TMR Manual Elapsed (us) TMR Manual Orch (us) ABG Elapsed (us) Notes
paged_attention Case1 77.9 62.1 124.5 108.3 31625.5 all correctness checks pass
paged_attention Case2 93.9 72.9 141.9 118.9 16611.7 all correctness checks pass
paged_attention_unroll Case1 1135.8 762.2 1124.6 638.3 1384.1 all correctness checks pass
paged_attention_unroll Case2 517.1 305.9 494.5 251.2 675.7 ABG golden fails

Reading of the current batch:

  • non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
  • unroll manual scope is slightly faster than TMR AUTO on both kept cases
  • ABG unroll Case2 is not a correctness-clean baseline in this rerun

Testing

  • ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failure
  • python -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -q
  • real-device golden reruns on a2a3, device 9
  • real-device 100-round timing reruns with device-log parsing for the paths above

@uv-xiao uv-xiao requested a review from poursoul April 15, 2026 08:03
@uv-xiao uv-xiao marked this pull request as ready for review April 15, 2026 08:03
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

const Tensor &mi_update = alloc_outs.get_ref(2);

for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
PTO2_SCOPE_GUARD();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.

To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.

                    PTO2_SCOPE_GUARD guard;
References
  1. Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.

@@ -0,0 +1,423 @@
# Manual Scope V0 Design

Date: 2026-04-15
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 2026-04-15 appears to be a typo, as it is in the future. Please consider correcting it.

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 15, 2026

Latest update on top of the current PR branch:

What changed

  • TaskOutputTensors now carries a standalone task_id, so submit/alloc results expose a usable task-id handle even for zero-output updater tasks.
  • The runtime now fully bypasses TensorMap lookup and insert for current-manual-scope-local tensors; boundary tensors still keep the conservative TensorMap path.
  • The paged-attention manual-scope examples now explicitly chain repeated updater tasks with prev_update_task instead of relying on manual-local TensorMap fallback.
  • The design note was refreshed to match the implemented semantics, fresh validation, benchmark results, and the new TensorMap profiling breakdown.

Fresh benchmark results

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example Case Auto Elapsed (us) Auto Orch (us) Manual Elapsed (us) Manual Orch (us) Elapsed Delta Orch Delta
paged_attention Case1 77.5 62.4 124.0 108.6 +46.5 +46.2
paged_attention Case2 96.2 74.1 146.7 124.6 +50.5 +50.5
paged_attention_unroll Case1 1138.4 800.7 1131.3 694.9 -7.1 -105.8
paged_attention_unroll Case2 519.7 319.1 511.3 282.6 -8.4 -36.5

Fresh TensorMap profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild.

Case Mode lookup+dep (us) tensormap_ins (us) Lookups Inserts Full Orch (us)
Case1 AUTO 4.132 1.842 40.0 12.0 194.508
Case1 MANUAL 1.944 1.414 16.0 3.0 259.318
Case2 AUTO 6.320 2.638 105.0 32.0 210.274
Case2 MANUAL 2.598 1.728 41.0 8.0 285.182

What these numbers show:

  • the manual-local TensorMap bypass is working: lookups dropped about 60%, inserts dropped about 75%, and lookup+dep time dropped about 53% to 59%
  • the remaining non-unroll regression is not coming from TensorMap lookup / insert anymore
  • the next optimization target should be explicit-dep construction in orchestration and explicit-dep validation / dedupe in the runtime submit path

The design note in docs/manual-scope-v0-design.md now records these results and the corresponding interpretation.

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 20, 2026

Alignment update against poursoul/manual_scope (dd76880):

  • The current manual_scope_v0 branch now follows the same hot-path structure in the main places that matter:
    • pto2_prepare_task() prebinds next_task_id() / next_task_slot() and zero-inits slot state with memset
    • TaskOutputTensors carries task_id() directly
    • payload->init(...) materializes outputs from PTO2TaskAllocResult / PTO2OutputLayout
    • manual-scope submit skips TensorMap lookup/insert and uses explicit deps directly
    • the unroll manual example uses the same narrowed update dependency chain shape
  • The intentional difference we keep is allocator-failure signaling:
    • this branch keeps the old negative sentinel {-1, -1, nullptr, nullptr}
    • poursoul’s visible patch returned {0, 0, nullptr, nullptr} while failed() still checked task_id < 0
    • we kept the negative sentinel because task id 0 is valid, so the zero-sentinel form is internally inconsistent unless the failure check also changes

I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (a2a3, device 9, 100 rounds, PTO-ISA d96c8784). The result is that the two are materially the same; there is no clear evidence that our current branch is slower because we mis-merged the colleague design.

Example Case Current Elapsed / Orch Poursoul Elapsed / Orch Poursoul vs Current
paged_attention Case1 74.6 / 59.2 77.3 / 61.4 +3.6% / +3.7%
paged_attention Case2 93.2 / 72.3 93.1 / 71.9 -0.1% / -0.6%
paged_attention_manual_scope Case1 118.1 / 101.3 120.9 / 105.6 +2.4% / +4.2%
paged_attention_manual_scope Case2 138.6 / 115.3 128.7 / 115.7 -7.1% / +0.4%
paged_attention_unroll Case1 1135.2 / 774.1 1140.1 / 766.9 +0.4% / -0.9%
paged_attention_unroll Case2 516.0 / 305.5 520.2 / 319.1 +0.8% / +4.5%
paged_attention_unroll_manual_scope Case1 1129.6 / 651.7 1128.9 / 646.2 -0.1% / -0.8%
paged_attention_unroll_manual_scope Case2 494.3 / 253.6 495.4 / 252.5 +0.2% / -0.4%

So the current state is:

  • alignment with poursoul’s design is in place for the main runtime/example paths
  • performance is effectively the same between the two branches within normal run-to-run noise for most rows
  • the remaining non-unroll gap vs AUTO is a real manual-scope-v0 issue, not just a divergence from poursoul’s branch

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 20, 2026

Update after rerunning the full batch on real device.

Cleaned state

This PR is still the same narrow v0 scope only:

  • runtime/API support for PTO2_SCOPE(PTO2ScopeMode::MANUAL)
  • explicit deps through Arg.add_dep(task_id)
  • submit-time publish only
  • standalone submit-result task_id for zero-output updater chaining
  • TensorMap lookup/insert bypass for current-manual-scope-local tensors
  • two a2a3 manual-scope examples:
    • paged_attention_manual_scope
    • paged_attention_unroll_manual_scope
  • unit / sim validation for the v0 rules

The branch is kept as 3 logical commits:

  1. runtime support
  2. paged-attention examples
  3. design / benchmark doc

Fresh real-device benchmark

Rerun on a2a3, device 9, PTO-ISA d96c8784, 100 rounds, trimmed average.

Golden status:

  • TMR AUTO/manual all PASS on the two paged-attention workloads
  • ABG paged_attention PASS
  • ABG paged_attention_unroll Case2 FAIL again, so that ABG row is still not correctness-clean
Example Case TMR AUTO Elapsed (us) TMR AUTO Orch (us) TMR Manual Elapsed (us) TMR Manual Orch (us) ABG Elapsed (us) Notes
paged_attention Case1 73.4 60.2 119.6 104.8 31385.1 all correctness checks pass
paged_attention Case2 93.9 73.3 137.1 114.6 16429.4 all correctness checks pass
paged_attention_unroll Case1 1137.0 772.7 1132.2 647.4 1383.3 all correctness checks pass
paged_attention_unroll Case2 523.0 317.7 492.6 251.2 676.7 ABG golden fails

Reading of the current batch:

  • non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
  • unroll manual scope is still slightly faster than TMR AUTO on both kept cases
  • ABG paged_attention_unroll Case2 remains an unstable / not correctness-clean baseline in reruns

TensorMap lookup / insert comparison

Profiling comparison from the current manual-scope implementation on non-unroll paged_attention:

Case Mode lookup+dep Trim (us) tensormap_ins Trim (us) TensorMap Lookups Avg TensorMap Inserts Avg Full Orch Trim (us)
Case1 AUTO 4.132 1.842 40.0 12.0 194.508
Case1 MANUAL 1.944 1.414 16.0 3.0 259.318
Case2 AUTO 6.320 2.638 105.0 32.0 210.274
Case2 MANUAL 2.598 1.728 41.0 8.0 285.182

What this still shows:

  • the manual-local TensorMap bypass is working
  • lookup / insert traffic is materially reduced in manual mode
  • the remaining non-unroll gap is not explained by TensorMap lookup / insert alone
  • the remaining cost is in the explicit-dep / orchestration path, not the old TensorMap path

uv-xiao added 3 commits April 21, 2026 01:47
- Add manual scope mode and explicit Arg dependency plumbing
- Attach submit-result task ids independently from output tensors
- Bypass TensorMap lookup and insert while manual scope is active
- Keep the runtime/examples/docs scope without adding test changes
- Add non-unroll and unroll manual-scope examples for a2a3 TMR
- Wire task ids through Arg.add_dep at submit time
- Keep AUTO paged-attention available as the comparison path
- Document v0 API constraints and submit-time dependency model
- Record TensorMap bypass behavior and boundary-edge rules
- Include the current device benchmark and validation notes
- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path
- Gate alloc-task retention with is_first/is_last instead of attaching it on every update
- Verified with fresh hardware golden and 100-round reruns on device 9
@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 21, 2026

Small follow-up after aligning the two manual-scope paged-attention examples.

What changed

The non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:

  • every update keeps the direct pv_outs.task_id() producer edge
  • the first update depends on the allocation task
  • later updates depend on the previous update task
  • the last non-first update also retains the allocation task, matching the unroll path

This removes the older conservative pattern where non-unroll attached alloc_task to every update.

Targeted real-device rerun

Rerun only the affected non-unroll manual example on a2a3, device 9, PTO-ISA d96c8784, with --build.

Golden:

  • paged_attention_manual_scope Case1: PASS
  • paged_attention_manual_scope Case2: PASS

100-round trimmed benchmark for the affected rows:

Example Case Before Manual Elapsed (us) Before Manual Orch (us) After Manual Elapsed (us) After Manual Orch (us)
paged_attention_manual_scope Case1 119.6 104.8 117.3 102.8
paged_attention_manual_scope Case2 137.1 114.6 133.8 112.4

So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant