Skip to content

Add: ChipBootstrapChannel for per-chip bootstrap handshake (L2)#608

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:feat/chip-bootstrap-channel-l2
Apr 20, 2026
Merged

Add: ChipBootstrapChannel for per-chip bootstrap handshake (L2)#608
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:feat/chip-bootstrap-channel-l2

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

  • Introduce ChipBootstrapChannel — a one-shot 4096 B cross-process mailbox for parent-child chip bootstrap handshake
  • Three-state machine (IDLE/SUCCESS/ERROR) with acquire/release memory barriers (aarch64 ldar/stlr, x86_64 compiler barrier)
  • Nanobind Python bindings exposed via _task_interface
  • 7 unit tests covering in-process and fork-based cross-process scenarios (no hardware required)

Design Decisions

  1. Mailbox size: 4096 B (one page). HEADER=64, ERROR_MSG=1024, PTR_CAPACITY=376
  2. State machine: IDLE/SUCCESS/ERROR with values 0/1/2 for future headroom
  3. Memory ordering: Same 3-branch pattern as WorkerThread mailbox in worker_manager.cpp
  4. Error message: strncpy with null termination at size-1, compatible with L4 convention

Testing

  • All 7 UT cases pass (pytest tests/ut/py/test_worker/test_bootstrap_channel.py)
  • No hardware dependency — runs on no-hw CI runners

Part of PR #571 split plan (L2).

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ChipBootstrapChannel class, a shared-memory mailbox for cross-process chip bootstrapping, including its C++ implementation, Python bindings, and unit tests. The review feedback highlights several safety improvements for handling shared memory, such as validating buffer capacities in the constructor and ensuring that counts and strings read from the mailbox are bounds-checked to prevent memory safety vulnerabilities.

Comment thread src/common/hierarchical/chip_bootstrap_channel.cpp
Comment thread src/common/hierarchical/chip_bootstrap_channel.cpp
Comment thread src/common/hierarchical/chip_bootstrap_channel.cpp
@ChaoWao ChaoWao force-pushed the feat/chip-bootstrap-channel-l2 branch from 970225f to 8f2acbb Compare April 20, 2026 12:30
Introduce a one-shot cross-process mailbox class for parent-child
bootstrap communication, independent of the task-mailbox protocol.
Includes C++ implementation, nanobind Python bindings, and 7 UT cases
covering in-process and fork-based cross-process scenarios.

Design decisions:
- Mailbox size: 4096 B (one page). HEADER_SIZE=64, ERROR_MSG_SIZE=1024,
  PTR_CAPACITY=376 — sufficient for all foreseeable chip buffer counts.
- State machine: IDLE/SUCCESS/ERROR three states. Values 0/1/2 leave
  headroom for future intermediate states without serialization migration.
- Memory ordering: aarch64 ldar/stlr inline asm (first, per codestyle hw-native-sys#6),
  x86_64 compiler barrier, __atomic_load/store fallback — same pattern as
  WorkerThread mailbox in worker_manager.cpp.
- Error message: strncpy with explicit null termination at size-1,
  compatible with L4 task-mailbox error message convention.

Cross-process read hardening:
- Ctor rejects max_buffer_count > CHIP_BOOTSTRAP_PTR_CAPACITY so the
  clamp invariant holds for every subsequent read.
- buffer_ptrs() clamps the shared-memory count against max_buffer_count_
  so a corrupted or premature read cannot overrun the pointer region.
- error_message() uses strnlen(CHIP_BOOTSTRAP_ERROR_MSG_SIZE) instead of
  trusting the null-terminator in shared memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the feat/chip-bootstrap-channel-l2 branch from 8f2acbb to b0b0f28 Compare April 20, 2026 12:42
@ChaoWao ChaoWao merged commit 25fc65f into hw-native-sys:main Apr 20, 2026
14 checks passed
@ChaoWao ChaoWao deleted the feat/chip-bootstrap-channel-l2 branch April 20, 2026 12:57
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants