Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 12 additions & 9 deletions examples/workers/l3/allreduce_distributed/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,18 @@

The kernel (ported verbatim from #307) reads every rank's contribution out of
the HCCL window via CommRemotePtr and sums them into each rank's own window
slot. This example exercises the full L1a..L6 stack:

L1a HCCL backend comm_init / comm_alloc_windows
L1b ChipWorker.comm_* wrappers host-side bootstrap of the communicator
L2 ChipBootstrapChannel chip child publishes SUCCESS to the parent
L3 mailbox atomics parent/child sync without torn reads
L4 error propagation bootstrap failures raise from Worker.init()
L5 ChipWorker.bootstrap_context one-shot per-chip bring-up
L6 Worker(chip_bootstrap_configs=[...]) Worker-level orchestration
slot. The distributed bring-up stack this exercises, bottom up:

- HCCL backend comm_init / comm_alloc_windows
- ChipWorker.comm_* wrappers host-side bootstrap of the communicator
- ChipBootstrapChannel chip child publishes SUCCESS to the parent
- mailbox atomics parent/child sync without torn reads
- error propagation bootstrap failures raise from Worker.init()
- ChipWorker.bootstrap_context one-shot per-chip bring-up
- Worker(chip_bootstrap_configs=...) Worker-level orchestration

These are the components that compose the bring-up — not framework hierarchy
levels (see docs/hierarchical_level_runtime.md for the L0–L6 topology).

Hardware only. The sim backend's CommRemotePtr uses a different addressing
scheme; sim support is out of scope for this demo.
Expand Down
4 changes: 2 additions & 2 deletions src/a2a3/platform/onboard/host/device_runner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -313,13 +313,13 @@ int DeviceRunner::destroy_comm_stream(void *stream) {
if (stream == nullptr) return 0;

// Best-effort teardown. HcclBarrier submits async work on the stream;
// if the caller never blocked for completion (or hit the L1a 507018
// if the caller never blocked for completion (or hit the HCCL 507018
// barrier regression), aclrtDestroyStream will refuse with 507901
// ("stream still has pending tasks"). We try to drain first, then
// destroy anyway, and log failures without propagating them — leaking
// a stream at teardown is strictly better than failing the teardown
// itself, which would block device finalization. This matches the
// cleanup behavior of the L1a C++ hardware UT.
// cleanup behavior of the HCCL C++ hardware UT.
aclError sync_rc = aclrtSynchronizeStream(static_cast<aclrtStream>(stream));
if (sync_rc != ACL_SUCCESS) {
LOG_ERROR("aclrtSynchronizeStream during stream teardown failed: %d", static_cast<int>(sync_rc));
Expand Down
2 changes: 1 addition & 1 deletion src/common/platform_comm/comm_sim.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
* Shared memory layout (page-aligned header + per-rank windows):
* [ SharedHeader (4096 bytes) ][ rank-0 window ][ rank-1 window ] ...
*
* L1a contract alignment notes:
* HCCL backend contract alignment notes:
* - comm_init takes (int rank, int nranks, void *stream, const char *rootinfo_path).
* The sim backend ignores `stream` (no ACL/device in simulation).
* - nranks is bounds-checked against COMM_MAX_RANK_NUM (64) because the
Expand Down
2 changes: 1 addition & 1 deletion tests/ut/py/test_worker/test_bootstrap_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
"""Unit tests for ChipBootstrapChannel (L2 bootstrap mailbox).
"""Unit tests for ChipBootstrapChannel (per-chip bootstrap mailbox).

All tests run without hardware — pure shared-memory / in-process.
"""
Expand Down
17 changes: 9 additions & 8 deletions tests/ut/py/test_worker/test_bootstrap_context_hw.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
# ruff: noqa: PLC0415
"""Hardware smoke test for ``ChipWorker.bootstrap_context`` (L5).
"""Hardware smoke test for ``ChipWorker.bootstrap_context``.

Drives the L5 one-shot bring-up against the real ``tensormap_and_ringbuffer``
Drives the one-shot bring-up against the real ``tensormap_and_ringbuffer``
runtime on 2 Ascend devices. The critical assertions are:

1. ``bootstrap_context`` returns a non-null ``device_ctx`` and
Expand All @@ -18,13 +18,13 @@
3. A single ``ChipBufferSpec`` slices the window so
``buffer_ptrs[0] == local_window_base``.

Deliberately **no** ``comm_barrier``. The paired L1b UT
Deliberately **no** ``comm_barrier``. The paired ``comm_*`` UT
(``test_platform_comm.py``) already shows the known HCCL 507018 path fails
after ~52 s on some CANN builds; ``bootstrap_context`` does not issue a
barrier, so this test completes on any build. Cross-rank synchronization
between the two ranks is already enforced inside
``HcclCommInitRootInfo`` / the L1a root-info handshake that ``comm_init``
performs, so the non-barrier invariants above are enough to prove the L5
``HcclCommInitRootInfo`` / the root-info handshake that ``comm_init``
performs, so the non-barrier invariants above are enough to prove the
bring-up crossed both ranks.
"""

Expand Down Expand Up @@ -89,8 +89,9 @@ def _bootstrap_rank_entry( # noqa: PLR0913
result["actual_window_size"] = int(res.actual_window_size)
result["buffer_ptrs"] = list(res.buffer_ptrs)

# Teardown mirrors the L6 ordering: shutdown_bootstrap (releases the
# HCCL comm handle) then finalize (releases ACL / unloads runtime).
# Teardown mirrors the Worker bootstrap loop ordering: shutdown_bootstrap
# (releases the HCCL comm handle) then finalize (releases ACL / unloads
# runtime).
worker.shutdown_bootstrap()
worker.finalize()
result["ok"] = True
Expand Down Expand Up @@ -173,7 +174,7 @@ def test_two_rank_bootstrap_context(st_device_ids):
assert r["actual_window_size"] >= window_size, (
f"rank {rank}: actual_window_size={r['actual_window_size']} < requested {window_size}"
)
# 1:1 buffer-to-spec invariant — the contract L6's ChipContext relies on.
# 1:1 buffer-to-spec invariant — the contract ChipContext relies on.
assert r["buffer_ptrs"] == [r["local_window_base"]], (
f"rank {rank}: buffer_ptrs={r['buffer_ptrs']} != [{r['local_window_base']}]"
)
9 changes: 5 additions & 4 deletions tests/ut/py/test_worker/test_bootstrap_context_sim.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
# ruff: noqa: PLC0415
"""Simulation-backend tests for ``ChipWorker.bootstrap_context`` (L5).
"""Simulation-backend tests for ``ChipWorker.bootstrap_context``.

These tests run without any Ascend NPU. They drive the sim backend of the
``tensormap_and_ringbuffer`` runtime, whose ``comm_*`` lifecycle is backed by
Expand Down Expand Up @@ -126,8 +126,9 @@ def _rank_entry( # noqa: PLR0913
worker.copy_from(ctypes.addressof(host_buf), res.buffer_ptrs[0], readback_nbytes)
result["readback"] = bytes(host_buf)

# shutdown_bootstrap + finalize — matches the L6 teardown order
# and leaves the sim shm segment clean for the next test.
# shutdown_bootstrap + finalize — matches the Worker bootstrap
# loop's teardown order and leaves the sim shm segment clean for
# the next test.
worker.shutdown_bootstrap()
worker.finalize()
result["ok"] = True
Expand Down Expand Up @@ -227,7 +228,7 @@ def test_two_rank_no_host_inputs(self):
assert r is not None and r.get("ok"), f"rank {rank} failed: {r and r.get('error')}"
assert r["local_window_base"] != 0, f"rank {rank} local_window_base is 0"
assert r["actual_window_size"] >= 4096
# Single buffer at window base — the 1:1 contract L6 relies on.
# Single buffer at window base — the 1:1 contract ChipContext relies on.
assert r["buffer_ptrs"] == [r["local_window_base"]]


Expand Down
12 changes: 6 additions & 6 deletions tests/ut/py/test_worker/test_platform_comm.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
# ruff: noqa: PLC0415
"""Hardware UT for ChipWorker.comm_* wrappers (Python surface of the L1a HCCL backend).
"""Hardware UT for ChipWorker.comm_* wrappers (Python surface of the HCCL backend).

This is the Python twin of tests/ut/cpp/test_hccl_comm.cpp. It drives the
full comm lifecycle entirely through ChipWorker's public Python API:
Expand All @@ -28,7 +28,7 @@
per rank. The parent only waits on exit codes plus a small result queue used
to surface CommContext field values.

Known issue inherited from L1a (HCCL 507018): on certain CANN builds
Known issue inherited from the HCCL backend (HCCL 507018): on certain CANN builds
`HcclBarrier` + `aclrtSynchronizeStream` report 507018 after ~52s of timeout.
That is a CANN-coupling bug tracked separately; this test treats a barrier
failure as a warning and still asserts the non-barrier invariants (init/alloc
Expand Down Expand Up @@ -143,10 +143,10 @@ def _rank_entry(
result["rank_id"] = int(host_ctx.rankId)
result["rank_num"] = int(host_ctx.rankNum)

# Barrier. L1a observed CANN error 507018 here on some builds; that
# bug is tracked independently. Surface the failure to the parent as
# a warning and continue with teardown so the non-barrier invariants
# above still gate this test.
# Barrier. The C++ HCCL UT observed CANN error 507018 here on some
# builds; that bug is tracked independently. Surface the failure to
# the parent as a warning and continue with teardown so the
# non-barrier invariants above still gate this test.
try:
worker.comm_barrier(comm)
result["barrier_ok"] = True
Expand Down
Loading