Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export by indigo1973 · Pull Request #500 · hw-native-sys/simpler

indigo1973 · 2026-04-09T18:50:05Z

Replace the boolean enable_profiling flag with an integer perf_level

throughout the profiling pipeline (CLI → Python bindings → C++ runtime →

AICPU executor → PerformanceCollector → JSON export).

Profiling levels:

0 = off

1 = AICore task start/end timing only (JSON version 0)

2 = + dispatch timestamps, finish timestamps, fanout edges (JSON version 1)

3 = + AICPU scheduler/orchestrator phase buffers (JSON version 2)

Key changes:

ChipCallConfig: bool enable_profiling → int perf_level

CLI --enable-profiling: store_true → optional int (bare flag defaults to 3)

nanobind property: backward-compatible bool→3 coercion for legacy callers

AICPU executor: split into task_recording_enabled (>0) vs

phase_recording_enabled (>=3) to skip phase overhead at lower levels

PerformanceCollector: skip phase buffer allocation when perf_level < 3;

version selection based on perf_level and presence of phase data

swimlane_converter.py: accept version 0, tolerate missing fanout field

Fix scene_test.py: val and cond truncated int to bool; use ternary

gemini-code-assist

Code Review

This pull request introduces a more granular performance profiling system by replacing the boolean 'enable_profiling' flag with an integer 'perf_level' across the codebase. This change allows for multiple profiling modes (0=off, 1=AICore-only, 2=task+fanout, 3=full) and includes updates to the runtime, device runner, and performance collector to handle these levels. Additionally, it updates the swimlane JSON export logic to support versioning based on the profiling level and improves robustness in the swimlane converter by handling optional fanout data. I have no feedback to provide as there are no review comments to evaluate.

…ngbuffer swimlane export Replace the boolean enable_profiling flag with an integer perf_level throughout the profiling pipeline (CLI → Python bindings → C++ runtime → AICPU executor → PerformanceCollector → JSON export). Profiling levels: 0 = off 1 = AICore task start/end timing only (JSON version 0) 2 = + dispatch timestamps, finish timestamps, fanout edges (JSON version 1) 3 = + AICPU scheduler/orchestrator phase buffers (JSON version 2) Key changes: - ChipCallConfig: bool enable_profiling → int perf_level - CLI --enable-profiling: store_true → optional int (bare flag defaults to 3) - nanobind property: backward-compatible bool→3 coercion for legacy callers - AICPU executor: split into task_recording_enabled (>0) vs phase_recording_enabled (>=3) to skip phase overhead at lower levels - PerformanceCollector: skip phase buffer allocation when perf_level < 3; version selection based on perf_level and presence of phase data - swimlane_converter.py: accept version 0, tolerate missing fanout field - Fix scene_test.py: `val and cond` truncated int to bool; use ternary

indigo1973 · 2026-04-15T09:05:52Z

/gemini review

gemini-code-assist

Code Review

This pull request replaces the boolean profiling flag with a multi-level integer perf_level (0-3) to provide granular control over performance data collection, ranging from AICore-only to full profiling with phase data. The changes span Python CLI tools, C++ runtime structures, and shared memory allocation logic in the PerformanceCollector. Review feedback highlights opportunities to simplify redundant logical checks in the performance-critical scheduler path where perf_level is validated alongside derived boolean flags.

gemini-code-assist · 2026-04-15T09:10:04Z

+            if (task_recording_enabled && runtime->perf_level >= 2) {
                core_exec_state.pending_dispatch_timestamp = get_sys_cnt_aicpu();
            }


The condition task_recording_enabled && runtime->perf_level >= 2 is redundant. Since task_recording_enabled is true when runtime->perf_level > 0, this check is equivalent to just runtime->perf_level >= 2. Simplifying this avoids redundant checks on a performance-critical path.

if (runtime->perf_level >= 2) { core_exec_state.pending_dispatch_timestamp = get_sys_cnt_aicpu(); }

References

On performance-critical paths, avoid redundant validation checks for variables that are guaranteed to be valid by the logic at their point of entry.

gemini-code-assist · 2026-04-15T09:10:04Z

+            if (task_recording_enabled && runtime->perf_level >= 2) {
                core_exec_state.running_dispatch_timestamp = get_sys_cnt_aicpu();
            }


Similar to the previous check, the condition task_recording_enabled && runtime->perf_level >= 2 is redundant and can be simplified to runtime->perf_level >= 2 to improve performance and readability on this hot path.

if (runtime->perf_level >= 2) { core_exec_state.running_dispatch_timestamp = get_sys_cnt_aicpu(); }

References

Avoid redundant data validation checks on performance-critical paths if the data has already been validated or is logically guaranteed.

ChaoWao · 2026-04-21T03:37:57Z

@indigo1973 Rebase isn't viable, and the design wants refinement before landing.

Rebase blocked by

30+ commits of drift. Key blocker: 7740dd2 (Refactor: Optimize the AICPU scheduler hot loop) rewrote the exact scheduler loop this PR targets, producing a 267-line single-block conflict in tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp where main's side is essentially empty — the code this PR edits largely no longer exists. Plus d72c991 removed run_example.py / code_runner.py, 6881fcc dropped the Dist* prefix, e4188cc rewrote performance_collector.cpp for per-subprocess output. Pushing the rebase through = re-implementing on main's new structure while carrying this PR's commit hash.

What to keep

Tiered profiling instead of boolean
Levels: AICore-only → +dispatch/finish/fanout → +AICPU phase buffers
Phase-buffer allocation gated by level
JSON schema version tied to level
CLI --enable-profiling[=LEVEL]

What to change

1. Pack all DFX toggles into uint64_t dfx_flags — 1 bit per feature. One shared dfx_flags.h, host / AICPU / AICore all include the same layout:

[0] chip_swimlane_aicore_timing    # AICore task start/end
[1] chip_swimlane_task_lifecycle   # dispatch + finish + fanout (implies [0])
[2] chip_swimlane_phase_buffers    # AICPU phases            (implies [1])
[3] dump_tensor_enable
[4:] reserved — one bit per future DFX toggle

Today's int perf_level decomposes into bits 0..2. Python keeps chip_swimlane_level: int as a user-facing convenience that maps to bits; C++ stores and transports raw bits. Adding a new DFX feature = one more bit, no signature changes on ChipCallConfig / pto_runtime_c_api / Runtime / KernelArgs. The HasEnableProfilingMember SFINAE helper goes away — no dual enable_profiling / perf_level fields to keep in sync.

2. Rename perf_level → chip_swimlane_level. "Profiling" is too generic — this feature specifically produces L2 chip-scope swimlane output. Reserving the name leaves dist_swimlane_level available for L3+. CLI: --chip-swimlane[=LEVEL].

3. Separate PR: ChipCallConfig → CallConfig. The struct is used by ChipWorker (L2) and hierarchical::Worker / Orchestrator (L3+). The Chip prefix is misleading. Trivial rename, independent of the above.

Path forward

Close this PR, replace with three independent PRs co-authored with you:

ChipCallConfig → CallConfig (pure rename)
dfx_flags.h + chip_swimlane_level + tiered AICPU phase recording (carries this PR's intent: tiered levels, phase gating, JSON versioning)
PerformanceCollector → ChipSwimlaneCollector + CLI rename

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

indigo1973 added this to pto project Apr 10, 2026

indigo1973 removed this from pto project Apr 10, 2026

indigo1973 force-pushed the prof_0409 branch 2 times, most recently from 0b39cb0 to 5b10ea4 Compare April 15, 2026 07:02

indigo1973 force-pushed the prof_0409 branch from 5b10ea4 to a0f75e3 Compare April 15, 2026 09:05

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500

Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0409

indigo1973 commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

indigo1973 commented Apr 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

ChaoWao commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

indigo1973 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

indigo1973 commented Apr 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoWao commented Apr 21, 2026

Rebase blocked by

What to keep

What to change

Path forward

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

indigo1973 commented Apr 9, 2026 •

edited

Loading