Skip to content

Fix: Python-side acquire/release on mailbox state#609

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:fix/python-mailbox-state-acquire-release
Apr 21, 2026
Merged

Fix: Python-side acquire/release on mailbox state#609
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:fix/python-mailbox-state-acquire-release

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 20, 2026

Summary

Part 3 of splitting #571. Already landed: L1a (#592 HCCL platform C API), L1b (#597 sim+ChipWorker+Python), L4 (#605 error propagation). Independent follow-ups (L2 DistChipBootstrapChannel, L5/L6/L7) are separate PRs.

The C++ side of the mailbox handshake (WorkerThread::read_mailbox_state / write_mailbox_state in worker_manager.cpp) uses ldar/stlr on aarch64 and a compiler-barriered plain store on x86_64 for MAILBOX_OFF_STATE. The Python side — three worker loops, _chip_control, and init/close — used plain struct.pack_into("i", buf, _OFF_STATE, …) / unpack_from with no memory barrier.

On aarch64 that lets the child's OFF_STATE = TASK_DONE release-store pass the preceding OFF_ERROR / OFF_ERROR_MSG plain writes, so a parent that acquire-loads TASK_DONE can read stale error text. Program order is not memory order on a weakly-consistent CPU.

Changes

  • python/bindings/worker_bind.h: add inline mailbox_load_i32 / mailbox_store_i32 (aarch64 ldar/stlr first per codestyle.md support extern func define in aicore #6, x86_64 compiler barrier + plain access, __atomic_{load,store} with ACQUIRE / RELEASE fallback). Expose on _task_interface as _mailbox_load_i32 / _mailbox_store_i32. Underscore prefix keeps them out of task_interface.__all__; only simpler.worker imports them.
  • python/simpler/worker.py: every OFF_STATE read/write (18 sites across _sub_worker_loop, _chip_process_loop, _child_worker_loop, _chip_control, _init_hierarchical, close()) now goes through the helper. Error-path fields stay plain — they are published by the subsequent release-store. Adds _buffer_field_addr(buf, offset) helper.
  • tests/ut/py/test_worker/test_mailbox_atomics.py: 4 cases
    • single-process roundtrip (binding signatures, negative values, offset disambiguation)
    • fork cross-process visibility (parent observes child's final state value)
    • 1000-iteration payload-before-state ordering — child writes 0xDEADBEEFCAFEBABE payload then release-stores state=1; parent acquire-loads state==1 then must observe the sentinel. Exercises the exact invariant the three loops rely on to publish OFF_ERROR_MSG with TASK_DONE.
    • sub-worker dispatch smoke using a MAP_SHARED counter to verify the refactored loop round-trips cleanly.

Design decisions

  1. Exposure: C++ inline helpers in worker_bind.h bound via nanobind — one implementation, one ABI, no ctypes wrapper on the Python side.
  2. ABI ordering: __aarch64____x86_64__ → fallback, per codestyle.md support extern func define in aicore #6.
  3. Address type: uint64_t. Python computes ctypes.addressof(ctypes.c_char.from_buffer(buf)) + offset; C++ reinterpret_casts to volatile int32_t*. No void* (nanobind would treat it as a Python object).
  4. Invariant: in every loop and _chip_control, the state store is the last mailbox write of a transition; the state load is the first mailbox read. No field other than state needs atomic helpers.

Test plan

  • pytest tests/ut/py/test_worker/ — 48 passed, 1 skipped (platform_comm needs hw), 0 regressions
  • test_mailbox_atomics.py — 6 passed including 1000-iter fork race
  • test_error_propagation.py — 5 passed (L4 paths exercise every refactored site end-to-end)
  • CI on aarch64 self-hosted runner — this is where the race would actually show up if the invariant broke

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces atomic acquire-load and release-store helpers in C++ and exposes them to Python to ensure correct memory ordering for mailbox state transitions across multiple processes. By replacing standard struct packing and unpacking with these helpers, the implementation guarantees that payload writes, such as error messages, are visible to other processes before state changes are observed, which is critical for weakly-ordered architectures like aarch64. The changes include a new test suite to verify these atomic invariants and cross-process visibility. I have no feedback to provide as the existing review comments were purely validating the implementation.

The C++ WorkerThread::dispatch_process already uses ldar/stlr (aarch64) and
compiler-barriered plain store (x86_64) when it reads/writes the mailbox
OFF_STATE word. The Python side of the handshake — the three worker loops
(_sub_worker_loop, _chip_process_loop, _child_worker_loop), _chip_control,
and the init/close paths — was using plain struct.pack_into("i",
buf, _OFF_STATE, …) / unpack_from. On aarch64 that lets the state flip
leak ahead of the preceding OFF_ERROR / OFF_ERROR_MSG writes, so a parent
that observes TASK_DONE can read a stale error message.

Design decisions (per .docs/l3-audit.md):

1. Exposure: add inline mailbox_load_i32 / mailbox_store_i32 in
   worker_bind.h and bind them as _mailbox_load_i32 /
   _mailbox_store_i32 on _task_interface. Underscore prefix keeps them
   out of task_interface.__all__ — only simpler.worker imports them.

2. ABI: aarch64 ldar/stlr first (per .claude/rules/codestyle.md hw-native-sys#6),
   x86_64 second with __asm__ volatile("" ::: "memory") to stop the
   compiler from reordering across the TSO store, fallback to
   __atomic_{load,store} with ACQUIRE / RELEASE.

3. addr type: uint64_t. Python computes via
   ctypes.addressof(ctypes.c_char.from_buffer(buf)) + offset; C++
   reinterpret_casts to volatile int32_t*. No void*.

4. Field ordering: all OFF_ERROR / OFF_ERROR_MSG / _CTRL_OFF_RESULT
   writes happen BEFORE the release-store of OFF_STATE, and all
   non-state reads happen AFTER the acquire-load of OFF_STATE. Every
   state access in python/simpler/worker.py now goes through the
   helper — 18 sites across the three loops, _chip_control, init, and
   close — so the invariant is mechanical.

Adds tests/ut/py/test_worker/test_mailbox_atomics.py with four cases
(roundtrip, cross-process visibility, payload-before-state ordering over
1000 fork iterations, refactored sub-worker dispatch). Existing L4
error-propagation tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the fix/python-mailbox-state-acquire-release branch from 6bfbee3 to f49abb1 Compare April 20, 2026 13:04
@ChaoWao ChaoWao merged commit bd8258a into hw-native-sys:main Apr 21, 2026
27 of 28 checks passed
@ChaoWao ChaoWao deleted the fix/python-mailbox-state-acquire-release branch April 21, 2026 00:52
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant