Add: examples/workers/l3/allreduce_distributed example by PKUZHOU · Pull Request #307 · hw-native-sys/simpler

PKUZHOU · 2026-03-17T08:21:55Z

Closes #303。原 PR 按层拆入 main 后,这里作为端到端 demo 收尾。

本 PR 现在的内容

带一个 2 卡 allreduce 例子 + 对应硬件 ST,跑通 L1a..L6 stack:

examples/workers/l3/allreduce_distributed/
- kernels/aiv/allreduce_kernel.cpp —— 原作者 @PKUZHOU / echo_stone 的 kernel 原样沿用,只改了一处 include 路径 (\"common/comm_context.h\" → \"platform_comm/comm_context.h\"),对齐 L1b 把 header 移到 src/common/platform_comm/ 之后的位置。
- kernels/orchestration/allreduce_orch.cpp —— orchestration 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透传给 AIV task。走 scalar 是为了规避 Tensor 路径对指针的包装(Tensor 会改写成 Tensor struct 地址,kernel reinterpret_cast 拿回的是错的)。
- main.py —— 2 卡 harness:per-rank input 通过 SharedMemory + HostBufferStaging 在 bootstrap_context 阶段送进 HCCL window → Worker.init() 里 fork + bootstrap → orch_fn 为每 chip add_scalar × 5 提交到 submit_next_level → worker.copy_from 读回 output 对照 golden。
tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms([\"a2a3\"]),让 st-onboard-a2a3 CI job 自动拉起 main()。

这条线的上下文

原 #307 实现同样功能,但走 subprocess + ctypes 的 harness (distributed_worker.py / distributed_code_runner.py) 和独立的 DISTRIBUTED_CONFIG / run_example.py 路径。在把功能拆成 7 个 reviewable 层时,我们决定让分布式走框架主线 (Worker(chip_bootstrap_configs=...)),不再保留第二套入口。因此本 PR 在 head branch 上 force-push 了一份基于最新 main 的实现;原两个 commit 仍以 SHA 可达 (7a8bafd 31c2030) 但 PR Files 视图显示新内容。作者身份通过 commit trailer (Co-authored-by: echo_stone <liulei281@huawei.com>) 保留。

已合入的七层:

Layer	PR	作用
L1a	#592	HCCL backend (`comm_hccl.cpp`) + C++ hardware UT
L1b	#597	sim backend + `ChipWorker.comm_*` Python wrapper
L4	#605	child worker 错误向 `Worker.run()` 传播
L2	#608	`ChipBootstrapChannel` 父子握手邮箱
L3	#609	mailbox int32 state 的 acquire/release 原子 helper
L5	#610	`ChipWorker.bootstrap_context()` 一次性 per-chip bring-up
L6	#613	`Worker(chip_bootstrap_configs=...)` 级别编排 + `ChipContext`

CommDeviceContext ABI 与 CommRemotePtr 的设计全部来自原 PR,在每个 layer 的 PR body 里单独致谢。

已知待验证的点

本 PR 提交时只做了本机静态检查 (AST parse + 名称核对),没编译没跑。下面几点需要在 2 卡 a2a3 环境验证:

ChipCallable.build(signature=[]) + TaskArgs 只放 5 个 scalar 是否被 dispatch 路径接受 (expected_arg_count=5 会校验 tensor_count+scalar_count,理论上 OK,需要实测)
kernel 里跨 rank windowsIn[pe] MTE2 在 bootstrap 完成后是否可读 (L6 hw smoke 只测了 bootstrap,没真正 MTE2 过)
没有显式调 comm_barrier,依赖两 rank 都走完 bootstrap_context 后各自 copy_to 已落盘。如果被 HCCL 507018 阻塞,会退化到 "bootstrap 成功但 kernel 读到脏数据"

测试计划

本地 2 卡 a2a3 调通 python examples/workers/l3/allreduce_distributed/main.py -d 0-1
pytest tests/st/workers_l3/test_allreduce_distributed_hw.py --platform a2a3 --device 0-1 通过
st-onboard-a2a3 CI 绿
/dev/shm + /tmp/pto_allreduce_distributed_rootinfo_* 在 worker.close() 后清理干净

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust framework for distributed kernel execution on Ascend NPUs, leveraging HCCL for inter-device communication. It provides a flexible Python-based orchestration layer to manage the lifecycle of distributed tasks across various simpler runtimes, from compilation to verification. The core C++ worker is designed to be generic, abstracting away the complexities of device-specific setup and focusing on executing kernels efficiently in a multi-card environment.

Highlights

Distributed Execution Framework: Introduced a new distributed extension for simpler, enabling multi-card kernel execution on Ascend NPU environments using HCCL for RDMA communication.
Multi-Runtime Support: Implemented support for three simpler runtimes: host_build_graph, aicpu_build_graph, and tensormap_and_ringbuffer, allowing flexibility in orchestration and scheduling.
Generic Worker Process: Developed a generic C++ distributed_worker executable that handles HCCL initialization, resource allocation, and delegates kernel execution to simpler's runtime via dynamic loading, making it case-agnostic.
Python Orchestration Layer: Added a Python DistributedRunner class and CLI entry point (run_distributed_example.py) to orchestrate the compilation, data preparation, worker launching, and result verification for distributed examples.
Distributed TREDUCE Examples: Provided comprehensive examples for distributed TREDUCE (collective reduce sum) across 8 cards, validated for all three supported runtimes, demonstrating the framework's capabilities.
Zero Intrusion Design: Ensured that the new distributed functionality is implemented without modifying existing core simpler source code, maintaining backward compatibility and modularity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

distributed/CMakeLists.txt
- Added CMake configuration for building the distributed_worker executable, including finding necessary Ascend CANN SDK libraries like ACL and HCCL.
distributed/README.md
- Added comprehensive documentation in Chinese, detailing the architecture, supported runtimes, relationship to existing simpler code, directory structure, quick start guide, CLI parameters, and instructions for adding new distributed kernels.
distributed/include/hccl_context.h
- Added the HcclDeviceContext structure definition, which describes the device-side communication context for HCCL, including RDMA window addresses.
distributed/python/distributed_runner.py
- Added the DistributedRunner Python class, responsible for compiling artifacts, preparing per-rank input data, building the C++ worker, launching distributed processes, and verifying results against golden references.
distributed/python/run_distributed_example.py
- Added a Python CLI entry point for running distributed simpler examples, providing options for kernel directories, golden files, platform, number of ranks, and various execution control flags.
distributed/src/distributed_worker.cpp
- Added the distributed_worker C++ executable, which handles per-device processes, including ACL/HCCL initialization, RootInfo exchange, HCCL resource allocation, dynamic loading of simpler's runtime, and execution of kernels based on CLI arguments.
examples/a2a3/aicpu_build_graph/treduce_distributed/golden.py
- Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
- Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/kernel_config.py
- Added kernel configuration for the aicpu_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
- Added a C++ orchestration plugin for the aicpu_build_graph runtime, which reads arguments and builds a single AIV task using the aicpu_build_api.
examples/a2a3/host_build_graph/treduce_distributed/golden.py
- Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
examples/a2a3/host_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
- Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
examples/a2a3/host_build_graph/treduce_distributed/kernels/kernel_config.py
- Added kernel configuration for the host_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
examples/a2a3/host_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
- Added a C++ orchestration function for the host_build_graph runtime, which processes device pointers and scalars to create a single AIV task.
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/golden.py
- Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/aiv/treduce_kernel.cpp
- Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/kernel_config.py
- Added kernel configuration for the tensormap_and_ringbuffer runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/orchestration/treduce_orch.cpp
- Added a C++ orchestration function for the tensormap_and_ringbuffer runtime, which uses the PTO2 API to submit an AIV task with scalar parameters.

Activity

No specific human activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

本次PR为 simpler 项目引入了一套完整的分布式执行框架，这是一个结构清晰且意义重大的补充。它包含一个通用的C++ worker、基于Python的编排脚本以及针对三种不同运行时的示例。整体设计清晰，并遵循了对现有代码库“零侵入”的原则。

我的审查发现了一些可以改进的地方，主要涉及正确性、可维护性和效率。关键点包括：

Python运行器中默认的数据类型处理可能导致静默的数据损坏。
C++ worker中存在不必要的内存拷贝。
C++ worker中使用了魔数，并且对HCCL内部数据结构的处理方式比较脆弱。
示例内核中硬编码了rank数量限制，可能导致计算结果不正确。
示例目录之间存在大量的代码重复（例如 golden.py 和 treduce_kernel.cpp 在三个示例中完全相同）。建议将这些通用文件提取到共享位置。

解决这些问题将有助于提升这个新分布式框架的健壮性和可维护性。总的来说，这是一项出色的贡献。

- validate distributed buffer metadata and simplify output verification - support explicit device selection in run_example.py and ci.sh for CI - shrink treduce examples to 4 ranks, remove stale config, and guard invalid rank/root values - rename the per-rank helper to distributed_worker.py and document buffer layout Made-with: Cursor

- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction - add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py` - add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation Made-with: Cursor

走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>

…s-fork output readback - Fix type name: CommDeviceContext → CommContext (matching platform_comm/comm_context.h) - Implement store_to_host in chip child's main loop so post-kernel output is flushed to SharedMemory, working around copy_from IPC being broken across fork - Use SharedMemory for output readback in main.py instead of worker.copy_from Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

ChaoWao reviewed Mar 18, 2026

View reviewed changes

PKUZHOU force-pushed the main branch from e81e69e to 7a8bafd Compare March 18, 2026 04:55

ChaoWao marked this pull request as draft March 19, 2026 07:39

ChaoWao force-pushed the main branch from 31c2030 to b3fcb67 Compare April 21, 2026 08:59

ChaoWao marked this pull request as ready for review April 21, 2026 08:59

ChaoWao changed the title ~~Distributed HCCL harness and examples for three runtimes~~ Add: examples/workers/l3/allreduce_distributed 端到端 demo (WIP) Apr 21, 2026

ChaoWao changed the title ~~Add: examples/workers/l3/allreduce_distributed 端到端 demo (WIP)~~ Add: examples/workers/l3/allreduce_distributed example Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: examples/workers/l3/allreduce_distributed example#307

Add: examples/workers/l3/allreduce_distributed example#307
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:main

PKUZHOU commented Mar 17, 2026 •

edited by ChaoWao

Loading

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PKUZHOU commented Mar 17, 2026 • edited by ChaoWao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

本 PR 现在的内容

这条线的上下文

已知待验证的点

测试计划

相关

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PKUZHOU commented Mar 17, 2026 •

edited by ChaoWao

Loading