Skip to content

Add: ChipWorker.bootstrap_context one-shot chip bring-up (L5)#610

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:pr-571-l5-bootstrap-context
Apr 21, 2026
Merged

Add: ChipWorker.bootstrap_context one-shot chip bring-up (L5)#610
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:pr-571-l5-bootstrap-context

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 20, 2026

Summary

  • Adds ChipWorker.bootstrap_context(device_id, cfg, channel=None) — a single entry point that composes L1b's set_device + comm_* + copy_to and L2's ChipBootstrapChannel into the one-shot per-chip bring-up L6 will call for every forked chip child.
  • Adds ChipWorker.shutdown_bootstrap() — idempotent release of the HCCL comm handle stashed by bootstrap_context.
  • Adds 5 dataclasses in python/simpler/task_interface.py to describe the inputs/outputs: ChipCommBootstrapConfig, ChipBufferSpec, HostBufferStaging, ChipBootstrapConfig, ChipBootstrapResult.
  • Re-exports CHIP_BOOTSTRAP_MAILBOX_SIZE, ChipBootstrapChannel, and ChipBootstrapMailboxState from simpler.task_interface so callers need a single import.

Part of the PR #571 split (see the L1a/L1b/L2/L4 predecessors). L6 (parent-side Worker.init fork orchestration) is a separate PR that builds on this API.

Design decisions

  1. Channel is optional. Sim unit tests drive bootstrap_context directly and consume the return value — requiring a channel would force every test to allocate a mailbox shm. Channel is the L6 publish hook, not a structural component of L5.
  2. Dataclasses live in task_interface.py, not worker.py. They describe ChipWorker inputs, so they belong alongside ChipWorker. worker.py is L3+ concerns (scheduler, ring, mailbox).
  3. All errors → code=1 + "<ExceptionType>: <message>". Single try/except wraps the whole bring-up; callers never need to distinguish "before" vs "after" comm came up. code=1 aligns with L4.
  4. Comm handle lifecycle is explicit. bootstrap_context stashes the HCCL handle on self._comm_handle; shutdown_bootstrap() releases it (zero-handle guard makes double-call a no-op). finalize() deliberately does NOT chain into shutdown_bootstrap — L6 owns the teardown order.

Test plan

  • pytest tests/ut/py/test_worker/test_bootstrap_context_sim.py (no hardware; 4 cases: happy path, load_from_host round-trip, channel integration, error path)
  • pytest tests/ut/py/test_worker/test_bootstrap_context_hw.py --platform a2a3 --device 0-1 on a2a3 hardware (1 case: 2-rank HCCL bootstrap, no barrier — avoids known 507018)
  • Full tests/ut/py/test_worker still green under Linux CI (macOS sim masks some Linux-only failures — watching the CI run is load-bearing here)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a one-shot bootstrap mechanism for chip workers, enabling communicator initialization, memory window allocation, and host-to-device data staging via shared memory. It adds several configuration dataclasses and implements the bootstrap_context and shutdown_bootstrap methods within the ChipWorker class. Comprehensive hardware and simulation tests are also provided. Review feedback highlights a contradiction between the documentation and implementation regarding null communicators, a potential crash when handling zero-sized staging buffers, and the need for more robust handle cleanup in the shutdown process.

Comment thread python/simpler/task_interface.py Outdated
Comment thread python/simpler/task_interface.py
Comment thread python/simpler/task_interface.py Outdated
@ChaoWao ChaoWao force-pushed the pr-571-l5-bootstrap-context branch 2 times, most recently from 2b0a492 to f6066bf Compare April 21, 2026 01:24
Wraps L1b's ChipWorker.comm_* + copy_to + set_device and L2's
ChipBootstrapChannel into a single ChipWorker.bootstrap_context(device_id,
cfg, channel=None) entry point that:

  1. Sets the NPU device (set_device already wires ACL bring-up per L1b).
  2. Brings up the communicator (comm_init + comm_alloc_windows +
     comm_get_local_window_base + comm_get_window_size) when
     cfg.comm is non-None.  Skips the whole step when cfg.comm is None.
  3. Carves the per-rank window sequentially into the ChipBufferSpec[]
     list, validating placement=="window" and that the cumulative
     nbytes does not overflow the actual (possibly rounded-up) window
     size returned by the backend.  buffer_ptrs is 1:1 aligned with
     cfg.buffers so L6's ChipContext can build its name->ptr dict by zip.
  4. For every ChipBufferSpec with load_from_host=True, attaches to the
     matching HostBufferStaging POSIX shm (parent is expected to have
     created + filled it pre-fork), copies the bytes into the device
     window slice via ChipWorker.copy_to, and closes the local mapping.
  5. Publishes ChipBootstrapResult(device_ctx, local_window_base,
     actual_window_size, buffer_ptrs) via channel.write_success when
     a channel is provided; on any exception, publishes
     channel.write_error(1, "<ExceptionType>: <message>") first and
     re-raises.

Also adds ChipWorker.shutdown_bootstrap(), the matching teardown: it
releases the HCCL comm handle stashed on self._comm_handle by
bootstrap_context inside a try/finally so the zero-handle guard makes
the method truly idempotent even if comm_destroy raises.

Design decisions (4):

1. Channel parameter is Optional[ChipBootstrapChannel], not required.
   L5 unit tests -- especially the sim path where the child process
   consumes the return value directly -- must be able to drive
   bootstrap_context without allocating a per-chip mailbox.  The
   channel is the L6 publish hook for the parent-to-child handshake,
   not a structural component of L5 itself.  When channel=None,
   exceptions still propagate normally; the only thing skipped is
   the write_success/write_error side effect.

2. New dataclasses live in python/simpler/task_interface.py, not in
   worker.py.  ChipWorker is a task_interface module type and its
   one-shot config -- ChipCommBootstrapConfig, ChipBufferSpec,
   HostBufferStaging, ChipBootstrapConfig, ChipBootstrapResult --
   belongs alongside it.  worker.py describes L3+ Worker concerns
   (scheduler, ring, mailbox), which L5 does not touch.

3. Failure mode collapses all exceptions to code=1 with a
   "<ExceptionType>: <message>" body before rethrowing.  The single
   exit point wraps everything from set_device through the final
   channel.write_success, so callers never need to distinguish
   "before" vs "after" the communicator came up.  code=1 matches the
   L4 convention so downstream consumers that already multiplex on
   the mailbox error_code do not see a new value.  When channel is
   None, the exception is simply re-raised; there is no mailbox
   write path to skip.

4. Comm handle lifecycle is explicit.  On successful comm_init,
   bootstrap_context stashes the handle at self._comm_handle.
   shutdown_bootstrap() is the matching release: it comm_destroys
   the handle inside a try/finally and clears the field to zero, so
   a double call is a no-op -- and so is a retry after comm_destroy
   itself raises.  finalize() is intentionally NOT wired to this
   method; ChipWorker.finalize keeps its pre-L5 semantics and the
   teardown order (shutdown_bootstrap then finalize) is L6's
   orchestration concern.  Tests verify this order explicitly.

Tests:

- tests/ut/py/test_worker/test_bootstrap_context_sim.py (no hardware):
  * happy path: 2-rank fork on a2a3sim; each rank's ChipBootstrapResult
    has non-zero local_window_base, actual_window_size>=requested, and
    buffer_ptrs == [local_window_base].
  * load_from_host: parent stages 64 bytes in POSIX shm, child 0 runs
    bootstrap_context with load_from_host=True, then copy_from reads
    the device window back to host and asserts the payload round-
    tripped unchanged.
  * channel integration: parent allocates one mailbox shm per rank,
    children publish via ChipBootstrapChannel; parent verifies
    state==SUCCESS and every field matches the return value.
  * error path: single-process fork with placement="bogus" raises
    ValueError; parent reads ERROR state with error_code=1 and
    error_message starting "ValueError: " and containing "bogus".
- tests/ut/py/test_worker/test_bootstrap_context_hw.py (hardware):
  * 2-rank tensormap_and_ringbuffer bootstrap on a2a3 devices.
    Asserts device_ctx!=0, local_window_base!=0,
    actual_window_size>=requested, buffer_ptrs == [local_window_base].
    Deliberately does NOT call comm_barrier, so the known HCCL 507018
    failure path (already documented in L1b's test_platform_comm.py)
    cannot regress this test.

Incidental fix: src/common/platform_comm/comm_sim.cpp:make_shm_name
shortened so the worst-case shm name fits macOS's PSHMNAMLEN=31 limit.
The prior format `/simpler_comm_<pid>_<hash64>` reached 36 characters
and failed shm_open with EFILENAMEMAXEXCEEDED on darwin, which the
new L5 sim tests exercise for the first time in CI (the older
test_platform_comm.py is requires_hardware and so never ran on macOS).
The new format `/simpler_<pidhex>_<hash32>` is <= 26 characters and
works on both macOS and Linux; 32 bits of rootinfo-path hash is still
collision-resistant for the "one driver spawns N ranks" launch
pattern this backend is designed for.

Scope:

- python/simpler/task_interface.py: new dataclasses +
  bootstrap_context + shutdown_bootstrap + re-exports of
  CHIP_BOOTSTRAP_MAILBOX_SIZE, ChipBootstrapChannel, and
  ChipBootstrapMailboxState.
- src/common/platform_comm/comm_sim.cpp: shm name length fix above.
- Does not touch worker.py, nanobind bindings, or any runtime code --
  L5 is otherwise purely a Python composition layer over the
  L1a/L1b/L2 surfaces already merged upstream.
- L6 (parent-side Worker.init fork orchestration) is deliberately not
  addressed here; that is a separate PR that builds on this API.

Audit of existing ChipWorker signatures + the sim backend
ready-count-barrier constraint that forces the sim tests to fork
N rank children lives in .docs/l5-audit.md (local-only per repo
.gitignore convention).

Verified locally: tests/ut/py/test_worker (macOS arm64, Python 3.14)
  59 passed, 2 skipped (HCCL hardware + test_platform_comm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the pr-571-l5-bootstrap-context branch from f6066bf to 36c3fc8 Compare April 21, 2026 01:40
@ChaoWao ChaoWao merged commit bb7965f into hw-native-sys:main Apr 21, 2026
14 checks passed
@ChaoWao ChaoWao deleted the pr-571-l5-bootstrap-context branch April 21, 2026 01:53
ChaoWao added a commit to PKUZHOU/simpler that referenced this pull request Apr 21, 2026
走通 hw-native-sys#592 hw-native-sys#597 hw-native-sys#605 hw-native-sys#608 hw-native-sys#609 hw-native-sys#610 hw-native-sys#613 拼起来的分布式 stack。
通过 Worker(level=3, chip_bootstrap_configs=...) 让两卡各自把所有
rank 的 input 经 CommRemotePtr 跨 rank MTE2 求和,再写回自己的
output,用 worker.copy_from 读回校验。

文件:
- kernels/aiv/allreduce_kernel.cpp —— 从 hw-native-sys#307 (PKUZHOU / echo_stone)
  直接搬过来,只改了一处 include 路径 ("common/comm_context.h" →
  "platform_comm/comm_context.h"),对齐 L1b 移动后的 header 位置。
- kernels/orchestration/allreduce_orch.cpp —— 把 ChipStorageTaskArgs
  里的 5 个 scalar (input_ptr, output_ptr, nranks, root, device_ctx)
  原样透给 AIV task,不走 Tensor 包装(Tensor 路径会改写指针)。
- main.py —— 2 卡 harness:per-rank input 用 SharedMemory + HostBufferStaging
  在 bootstrap 阶段送进 window,init 后 unlink shm;orch_fn 每 chip
  add_scalar × 5 提交到 submit_next_level;copy_from 读回 output 校验。
- tests/st/workers_l3/test_allreduce_distributed_hw.py —— 挂 device_count(2)
  + platforms(["a2a3"]) 让 st-onboard-a2a3 自动拉起 main()。

WIP:本机只做了静态检查 (AST parse + import name 核对),没编译过
没跑过。下一步带到 2 卡 a2a3 环境调通;已知需要验证的点见 PR body。

Co-authored-by: echo_stone <liulei281@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant