Add: examples/workers/l3/allreduce_distributed example#307
Add: examples/workers/l3/allreduce_distributed example#307PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust framework for distributed kernel execution on Ascend NPUs, leveraging HCCL for inter-device communication. It provides a flexible Python-based orchestration layer to manage the lifecycle of distributed tasks across various simpler runtimes, from compilation to verification. The core C++ worker is designed to be generic, abstracting away the complexities of device-specific setup and focusing on executing kernels efficiently in a multi-card environment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本次PR为 simpler 项目引入了一套完整的分布式执行框架,这是一个结构清晰且意义重大的补充。它包含一个通用的C++ worker、基于Python的编排脚本以及针对三种不同运行时的示例。整体设计清晰,并遵循了对现有代码库“零侵入”的原则。
我的审查发现了一些可以改进的地方,主要涉及正确性、可维护性和效率。关键点包括:
- Python运行器中默认的数据类型处理可能导致静默的数据损坏。
- C++ worker中存在不必要的内存拷贝。
- C++ worker中使用了魔数,并且对HCCL内部数据结构的处理方式比较脆弱。
- 示例内核中硬编码了rank数量限制,可能导致计算结果不正确。
- 示例目录之间存在大量的代码重复(例如
golden.py和treduce_kernel.cpp在三个示例中完全相同)。建议将这些通用文件提取到共享位置。
解决这些问题将有助于提升这个新分布式框架的健壮性和可维护性。总的来说,这是一项出色的贡献。
- validate distributed buffer metadata and simplify output verification - support explicit device selection in run_example.py and ci.sh for CI - shrink treduce examples to 4 ranks, remove stale config, and guard invalid rank/root values - rename the per-rank helper to distributed_worker.py and document buffer layout Made-with: Cursor
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction - add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py` - add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation Made-with: Cursor
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction - add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py` - add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation Made-with: Cursor
- add backend-agnostic `comm_*` host APIs plus a2a3/a5 hardware and sim implementations so distributed runs share one communication abstraction - add Python bindings, distributed runner orchestration, and per-rank worker support to drive multi-rank examples through `run_example.py` - add distributed treduce examples for all three runtimes and fold in the PR hw-native-sys#307 review fixes for CI-friendly rank counts, explicit device selection, and stronger validation Made-with: Cursor
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。 通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有 rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的 output,用 worker.copy_from 读回校验。 文件: - kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone) 直接搬过来,只改了一处 include 路径 ("common/comm_context.h" → "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。 - kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。 - main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging 在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。 - tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2) + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。 WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过 没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。 Co-authored-by: echo_stone <liulei281@huawei.com>
…s-fork output readback - Fix type name: CommDeviceContext → CommContext (matching platform_comm/comm_context.h) - Implement store_to_host in chip child's main loop so post-kernel output is flushed to SharedMemory, working around copy_from IPC being broken across fork - Use SharedMemory for output readback in main.py instead of worker.copy_from Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes #303。原 PR 按层拆入 main 后,这里作为端到端 demo 收尾。
本 PR 现在的内容
带一个 2 卡 allreduce 例子 + 对应硬件 ST,跑通 L1a..L6 stack:
examples/workers/l3/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp—— 原作者 @PKUZHOU / echo_stone 的 kernel 原样沿用,只改了一处 include 路径 (\"common/comm_context.h\"→\"platform_comm/comm_context.h\"),对齐 L1b 把 header 移到src/common/platform_comm/之后的位置。kernels/orchestration/allreduce_orch.cpp—— orchestration 把 ChipStorageTaskArgs 里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx) 原样透传给 AIV task。走 scalar 是为了规避 Tensor 路径对指针的包装(Tensor 会改写成 Tensor struct 地址,kernel reinterpret_cast 拿回的是错的)。main.py—— 2 卡 harness:per-rank input 通过SharedMemory+HostBufferStaging在bootstrap_context阶段送进 HCCL window →Worker.init()里 fork + bootstrap → orch_fn 为每 chipadd_scalar × 5提交到submit_next_level→worker.copy_from读回 output 对照 golden。tests/st/workers_l3/test_allreduce_distributed_hw.py—— 挂device_count(2)+platforms([\"a2a3\"]),让st-onboard-a2a3CI job 自动拉起main()。这条线的上下文
原 #307 实现同样功能,但走 subprocess + ctypes 的 harness (
distributed_worker.py/distributed_code_runner.py) 和独立的DISTRIBUTED_CONFIG/run_example.py路径。在把功能拆成 7 个 reviewable 层时,我们决定让分布式走框架主线 (Worker(chip_bootstrap_configs=...)),不再保留第二套入口。因此本 PR 在 head branch 上 force-push 了一份基于最新 main 的实现;原两个 commit 仍以 SHA 可达 (7a8bafd31c2030) 但 PR Files 视图显示新内容。作者身份通过 commit trailer (Co-authored-by: echo_stone <liulei281@huawei.com>) 保留。已合入的七层:
comm_hccl.cpp) + C++ hardware UTChipWorker.comm_*Python wrapperWorker.run()传播ChipBootstrapChannel父子握手邮箱ChipWorker.bootstrap_context()一次性 per-chip bring-upWorker(chip_bootstrap_configs=...)级别编排 +ChipContextCommDeviceContextABI 与CommRemotePtr的设计全部来自原 PR,在每个 layer 的 PR body 里单独致谢。已知待验证的点
本 PR 提交时只做了本机静态检查 (AST parse + 名称核对),没编译没跑。下面几点需要在 2 卡 a2a3 环境验证:
ChipCallable.build(signature=[])+TaskArgs只放 5 个 scalar 是否被 dispatch 路径接受 (expected_arg_count=5会校验 tensor_count+scalar_count,理论上 OK,需要实测)windowsIn[pe]MTE2 在 bootstrap 完成后是否可读 (L6 hw smoke 只测了 bootstrap,没真正 MTE2 过)comm_barrier,依赖两 rank 都走完bootstrap_context后各自copy_to已落盘。如果被 HCCL 507018 阻塞,会退化到 "bootstrap 成功但 kernel 读到脏数据"测试计划
python examples/workers/l3/allreduce_distributed/main.py -d 0-1pytest tests/st/workers_l3/test_allreduce_distributed_hw.py --platform a2a3 --device 0-1通过st-onboard-a2a3CI 绿/dev/shm+/tmp/pto_allreduce_distributed_rootinfo_*在worker.close()后清理干净相关