Skip to content

Add: examples/workers/l3/allreduce_distributed example#307

Open
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:main
Open

Add: examples/workers/l3/allreduce_distributed example#307
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:main

Conversation

@PKUZHOU
Copy link
Copy Markdown

@PKUZHOU PKUZHOU commented Mar 17, 2026

Closes #303。原 PR 按层拆入 main 后,这里作为端到端 demo 收尾。

本 PR 现在的内容

带一个 2 卡 allreduce 例子 + 对应硬件 ST,跑通 L1a..L6 stack:

  • examples/workers/l3/allreduce_distributed/
    • kernels/aiv/allreduce_kernel.cpp —— 原作者 @PKUZHOU / echo_stone 的 kernel 原样沿用,只改了一处 include 路径 (\"common/comm_context.h\"\"platform_comm/comm_context.h\"),对齐 L1b 把 header 移到 src/common/platform_comm/ 之后的位置。
    • kernels/orchestration/allreduce_orch.cpp —— orchestration 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透传给 AIV task。走 scalar 是为了规避 Tensor 路径对指针的包装(Tensor 会改写成 Tensor struct 地址,kernel reinterpret_cast 拿回的是错的)。
    • main.py —— 2 卡 harness:per-rank input 通过 SharedMemory + HostBufferStagingbootstrap_context 阶段送进 HCCL window → Worker.init() 里 fork + bootstrap → orch_fn 为每 chip add_scalar × 5 提交到 submit_next_levelworker.copy_from 读回 output 对照 golden。
  • tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms([\"a2a3\"]),让 st-onboard-a2a3 CI job 自动拉起 main()

这条线的上下文

#307 实现同样功能,但走 subprocess + ctypes 的 harness (distributed_worker.py / distributed_code_runner.py) 和独立的 DISTRIBUTED_CONFIG / run_example.py 路径。在把功能拆成 7 个 reviewable 层时,我们决定让分布式走框架主线 (Worker(chip_bootstrap_configs=...)),不再保留第二套入口。因此本 PR 在 head branch 上 force-push 了一份基于最新 main 的实现;原两个 commit 仍以 SHA 可达 (7a8bafd 31c2030) 但 PR Files 视图显示新内容。作者身份通过 commit trailer (Co-authored-by: echo_stone <liulei281@huawei.com>) 保留。

已合入的七层:

Layer PR 作用
L1a #592 HCCL backend (comm_hccl.cpp) + C++ hardware UT
L1b #597 sim backend + ChipWorker.comm_* Python wrapper
L4 #605 child worker 错误向 Worker.run() 传播
L2 #608 ChipBootstrapChannel 父子握手邮箱
L3 #609 mailbox int32 state 的 acquire/release 原子 helper
L5 #610 ChipWorker.bootstrap_context() 一次性 per-chip bring-up
L6 #613 Worker(chip_bootstrap_configs=...) 级别编排 + ChipContext

CommDeviceContext ABI 与 CommRemotePtr 的设计全部来自原 PR,在每个 layer 的 PR body 里单独致谢。

已知待验证的点

本 PR 提交时只做了本机静态检查 (AST parse + 名称核对),没编译没跑。下面几点需要在 2 卡 a2a3 环境验证:

  1. ChipCallable.build(signature=[]) + TaskArgs 只放 5 个 scalar 是否被 dispatch 路径接受 (expected_arg_count=5 会校验 tensor_count+scalar_count,理论上 OK,需要实测)
  2. kernel 里跨 rank windowsIn[pe] MTE2 在 bootstrap 完成后是否可读 (L6 hw smoke 只测了 bootstrap,没真正 MTE2 过)
  3. 没有显式调 comm_barrier,依赖两 rank 都走完 bootstrap_context 后各自 copy_to 已落盘。如果被 HCCL 507018 阻塞,会退化到 "bootstrap 成功但 kernel 读到脏数据"

测试计划

  • 本地 2 卡 a2a3 调通 python examples/workers/l3/allreduce_distributed/main.py -d 0-1
  • pytest tests/st/workers_l3/test_allreduce_distributed_hw.py --platform a2a3 --device 0-1 通过
  • st-onboard-a2a3 CI 绿
  • /dev/shm + /tmp/pto_allreduce_distributed_rootinfo_*worker.close() 后清理干净

相关

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust framework for distributed kernel execution on Ascend NPUs, leveraging HCCL for inter-device communication. It provides a flexible Python-based orchestration layer to manage the lifecycle of distributed tasks across various simpler runtimes, from compilation to verification. The core C++ worker is designed to be generic, abstracting away the complexities of device-specific setup and focusing on executing kernels efficiently in a multi-card environment.

Highlights

  • Distributed Execution Framework: Introduced a new distributed extension for simpler, enabling multi-card kernel execution on Ascend NPU environments using HCCL for RDMA communication.
  • Multi-Runtime Support: Implemented support for three simpler runtimes: host_build_graph, aicpu_build_graph, and tensormap_and_ringbuffer, allowing flexibility in orchestration and scheduling.
  • Generic Worker Process: Developed a generic C++ distributed_worker executable that handles HCCL initialization, resource allocation, and delegates kernel execution to simpler's runtime via dynamic loading, making it case-agnostic.
  • Python Orchestration Layer: Added a Python DistributedRunner class and CLI entry point (run_distributed_example.py) to orchestrate the compilation, data preparation, worker launching, and result verification for distributed examples.
  • Distributed TREDUCE Examples: Provided comprehensive examples for distributed TREDUCE (collective reduce sum) across 8 cards, validated for all three supported runtimes, demonstrating the framework's capabilities.
  • Zero Intrusion Design: Ensured that the new distributed functionality is implemented without modifying existing core simpler source code, maintaining backward compatibility and modularity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • distributed/CMakeLists.txt
    • Added CMake configuration for building the distributed_worker executable, including finding necessary Ascend CANN SDK libraries like ACL and HCCL.
  • distributed/README.md
    • Added comprehensive documentation in Chinese, detailing the architecture, supported runtimes, relationship to existing simpler code, directory structure, quick start guide, CLI parameters, and instructions for adding new distributed kernels.
  • distributed/include/hccl_context.h
    • Added the HcclDeviceContext structure definition, which describes the device-side communication context for HCCL, including RDMA window addresses.
  • distributed/python/distributed_runner.py
    • Added the DistributedRunner Python class, responsible for compiling artifacts, preparing per-rank input data, building the C++ worker, launching distributed processes, and verifying results against golden references.
  • distributed/python/run_distributed_example.py
    • Added a Python CLI entry point for running distributed simpler examples, providing options for kernel directories, golden files, platform, number of ranks, and various execution control flags.
  • distributed/src/distributed_worker.cpp
    • Added the distributed_worker C++ executable, which handles per-device processes, including ACL/HCCL initialization, RootInfo exchange, HCCL resource allocation, dynamic loading of simpler's runtime, and execution of kernels based on CLI arguments.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the aicpu_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/aicpu_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration plugin for the aicpu_build_graph runtime, which reads arguments and builds a single AIV task using the aicpu_build_api.
  • examples/a2a3/host_build_graph/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the host_build_graph runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/host_build_graph/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration function for the host_build_graph runtime, which processes device pointers and scalars to create a single AIV task.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/golden.py
    • Added a Python script defining the golden reference for distributed TREDUCE, including functions to generate per-rank inputs and compute the expected golden output for verification.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/aiv/treduce_kernel.cpp
    • Added an AIV kernel implementation for distributed TREDUCE, utilizing PTO communication instructions and HcclDeviceContext for remote memory access.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/kernel_config.py
    • Added kernel configuration for the tensormap_and_ringbuffer runtime, specifying orchestration source, kernel details, runtime parameters, and distributed configuration including buffer layouts and arguments.
  • examples/a2a3/tensormap_and_ringbuffer/treduce_distributed/kernels/orchestration/treduce_orch.cpp
    • Added a C++ orchestration function for the tensormap_and_ringbuffer runtime, which uses the PTO2 API to submit an AIV task with scalar parameters.
Activity
  • No specific human activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次PR为 simpler 项目引入了一套完整的分布式执行框架,这是一个结构清晰且意义重大的补充。它包含一个通用的C++ worker、基于Python的编排脚本以及针对三种不同运行时的示例。整体设计清晰,并遵循了对现有代码库“零侵入”的原则。

我的审查发现了一些可以改进的地方,主要涉及正确性、可维护性和效率。关键点包括:

  • Python运行器中默认的数据类型处理可能导致静默的数据损坏。
  • C++ worker中存在不必要的内存拷贝。
  • C++ worker中使用了魔数,并且对HCCL内部数据结构的处理方式比较脆弱。
  • 示例内核中硬编码了rank数量限制,可能导致计算结果不正确。
  • 示例目录之间存在大量的代码重复(例如 golden.pytreduce_kernel.cpp 在三个示例中完全相同)。建议将这些通用文件提取到共享位置。

解决这些问题将有助于提升这个新分布式框架的健壮性和可维护性。总的来说,这是一项出色的贡献。

Comment thread examples/scripts/distributed_code_runner.py Outdated
Comment thread examples/scripts/distributed_code_runner.py Outdated
Comment thread distributed/src/distributed_worker.cpp Outdated
Comment thread distributed/src/distributed_worker.cpp Outdated
Comment thread distributed/src/distributed_worker.cpp Outdated
Comment thread examples/a2a3/host_build_graph/treduce_distributed/kernels/aiv/treduce_kernel.cpp Outdated
Comment thread examples/scripts/distributed_worker.py Outdated
Comment thread examples/scripts/run_example.py Outdated
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 18, 2026
- validate distributed buffer metadata and simplify output verification
- support explicit device selection in run_example.py and ci.sh for CI
- shrink treduce examples to 4 ranks, remove stale config, and guard invalid rank/root values
- rename the per-rank helper to distributed_worker.py and document buffer layout

Made-with: Cursor
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 18, 2026
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction
- add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py`
- add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation

Made-with: Cursor
@ChaoWao ChaoWao marked this pull request as draft March 19, 2026 07:39
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 23, 2026
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction
- add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py`
- add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation

Made-with: Cursor
PKUZHOU added a commit to PKUZHOU/simpler that referenced this pull request Apr 3, 2026
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction
- add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py`
- add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation

Made-with: Cursor
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
@ChaoWao ChaoWao marked this pull request as ready for review April 21, 2026 08:59
@ChaoWao ChaoWao changed the title Distributed HCCL harness and examples for three runtimes Add: examples/workers/l3/allreduce_distributed 端到端 demo (WIP) Apr 21, 2026
…s-fork output readback

- Fix type name: CommDeviceContext → CommContext (matching platform_comm/comm_context.h)
- Implement store_to_host in chip child's main loop so post-kernel output is
  flushed to SharedMemory, working around copy_from IPC being broken across fork
- Use SharedMemory for output readback in main.py instead of worker.copy_from

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChaoWao ChaoWao changed the title Add: examples/workers/l3/allreduce_distributed 端到端 demo (WIP) Add: examples/workers/l3/allreduce_distributed example Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

L1-L4分布式多卡支持

2 participants