From ea53c23a882c031abef316d01d93006fa5ce4fa2 Mon Sep 17 00:00:00 2001 From: wcwxy <26245345+ChaoWao@users.noreply.github.com> Date: Wed, 22 Apr 2026 11:37:27 +0800 Subject: [PATCH] Docs: drop PR-split L1a/L1b/L2/L5/L6 tags from distributed bring-up code MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The distributed bring-up series (#608 / #610 / #613) used `(L2)` / `(L5)` / `(L6)` as split-step labels in PR titles and commit messages. These tags leaked into docstrings, comments, and the allreduce_distributed example where they collide with the framework hierarchy in `docs/hierarchical_level_runtime.md` — which also calls its levels L0–L6, but means CORE / DIE / CHIP / HOST / POD / SuperNode / Cluster. Readers who encounter "L6 teardown order" or "the L1a root-info handshake" naturally reach for the topology definition and find the two numberings disagree (framework L6 = cluster, split-step L6 = `Worker` bootstrap loop). PR #613's body already announced the cleanup for new and touched code; this commit finishes it across the files that were left behind: - `examples/workers/l3/allreduce_distributed/main.py`: replace the fake "L1a..L6 stack" table with an unlabeled component list and a pointer to the real topology doc. - `tests/ut/py/test_worker/test_bootstrap_context_{sim,hw}.py`: drop `(L5)` from the module titles and rewrite `L6 teardown order` / `L6's ChipContext` / `L5 one-shot bring-up` / `L1a root-info handshake` / `paired L1b UT` to name the thing (`Worker bootstrap loop`, `ChipContext`, `bring-up`, etc.). - `tests/ut/py/test_worker/test_bootstrap_channel.py`: `(L2 bootstrap mailbox)` → `(per-chip bootstrap mailbox)` — the old tag was ambiguous with framework L2 (CHIP). - `tests/ut/py/test_worker/test_platform_comm.py`: `L1a HCCL backend` → `HCCL backend`; `Known issue inherited from L1a` → `... from the HCCL backend`; `L1a observed CANN error 507018` → `The C++ HCCL UT observed ...`. The two `L2-boundary contract` references are left alone — those correctly refer to the framework L2 boundary documented in `hierarchical_level_runtime.md`. - `src/common/platform_comm/comm_sim.cpp`: `L1a contract alignment notes` → `HCCL backend contract alignment notes`. - `src/a2a3/platform/onboard/host/device_runner.cpp`: `L1a 507018` → `HCCL 507018`; `L1a C++ hardware UT` → `HCCL C++ hardware UT`. No runtime behavior change — comments and docstrings only. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../workers/l3/allreduce_distributed/main.py | 21 +++++++++++-------- .../platform/onboard/host/device_runner.cpp | 4 ++-- src/common/platform_comm/comm_sim.cpp | 2 +- .../py/test_worker/test_bootstrap_channel.py | 2 +- .../test_worker/test_bootstrap_context_hw.py | 17 ++++++++------- .../test_worker/test_bootstrap_context_sim.py | 9 ++++---- tests/ut/py/test_worker/test_platform_comm.py | 12 +++++------ 7 files changed, 36 insertions(+), 31 deletions(-) diff --git a/examples/workers/l3/allreduce_distributed/main.py b/examples/workers/l3/allreduce_distributed/main.py index 8cb17a81c..d0e960dd9 100644 --- a/examples/workers/l3/allreduce_distributed/main.py +++ b/examples/workers/l3/allreduce_distributed/main.py @@ -11,15 +11,18 @@ The kernel (ported verbatim from #307) reads every rank's contribution out of the HCCL window via CommRemotePtr and sums them into each rank's own window -slot. This example exercises the full L1a..L6 stack: - - L1a HCCL backend comm_init / comm_alloc_windows - L1b ChipWorker.comm_* wrappers host-side bootstrap of the communicator - L2 ChipBootstrapChannel chip child publishes SUCCESS to the parent - L3 mailbox atomics parent/child sync without torn reads - L4 error propagation bootstrap failures raise from Worker.init() - L5 ChipWorker.bootstrap_context one-shot per-chip bring-up - L6 Worker(chip_bootstrap_configs=[...]) Worker-level orchestration +slot. The distributed bring-up stack this exercises, bottom up: + + - HCCL backend comm_init / comm_alloc_windows + - ChipWorker.comm_* wrappers host-side bootstrap of the communicator + - ChipBootstrapChannel chip child publishes SUCCESS to the parent + - mailbox atomics parent/child sync without torn reads + - error propagation bootstrap failures raise from Worker.init() + - ChipWorker.bootstrap_context one-shot per-chip bring-up + - Worker(chip_bootstrap_configs=...) Worker-level orchestration + +These are the components that compose the bring-up — not framework hierarchy +levels (see docs/hierarchical_level_runtime.md for the L0–L6 topology). Hardware only. The sim backend's CommRemotePtr uses a different addressing scheme; sim support is out of scope for this demo. diff --git a/src/a2a3/platform/onboard/host/device_runner.cpp b/src/a2a3/platform/onboard/host/device_runner.cpp index ed348d818..e3296f023 100644 --- a/src/a2a3/platform/onboard/host/device_runner.cpp +++ b/src/a2a3/platform/onboard/host/device_runner.cpp @@ -313,13 +313,13 @@ int DeviceRunner::destroy_comm_stream(void *stream) { if (stream == nullptr) return 0; // Best-effort teardown. HcclBarrier submits async work on the stream; - // if the caller never blocked for completion (or hit the L1a 507018 + // if the caller never blocked for completion (or hit the HCCL 507018 // barrier regression), aclrtDestroyStream will refuse with 507901 // ("stream still has pending tasks"). We try to drain first, then // destroy anyway, and log failures without propagating them — leaking // a stream at teardown is strictly better than failing the teardown // itself, which would block device finalization. This matches the - // cleanup behavior of the L1a C++ hardware UT. + // cleanup behavior of the HCCL C++ hardware UT. aclError sync_rc = aclrtSynchronizeStream(static_cast(stream)); if (sync_rc != ACL_SUCCESS) { LOG_ERROR("aclrtSynchronizeStream during stream teardown failed: %d", static_cast(sync_rc)); diff --git a/src/common/platform_comm/comm_sim.cpp b/src/common/platform_comm/comm_sim.cpp index be26259ab..fdbc1bc92 100644 --- a/src/common/platform_comm/comm_sim.cpp +++ b/src/common/platform_comm/comm_sim.cpp @@ -19,7 +19,7 @@ * Shared memory layout (page-aligned header + per-rank windows): * [ SharedHeader (4096 bytes) ][ rank-0 window ][ rank-1 window ] ... * - * L1a contract alignment notes: + * HCCL backend contract alignment notes: * - comm_init takes (int rank, int nranks, void *stream, const char *rootinfo_path). * The sim backend ignores `stream` (no ACL/device in simulation). * - nranks is bounds-checked against COMM_MAX_RANK_NUM (64) because the diff --git a/tests/ut/py/test_worker/test_bootstrap_channel.py b/tests/ut/py/test_worker/test_bootstrap_channel.py index 1712c14c7..e4ed628a2 100644 --- a/tests/ut/py/test_worker/test_bootstrap_channel.py +++ b/tests/ut/py/test_worker/test_bootstrap_channel.py @@ -6,7 +6,7 @@ # INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. # See LICENSE in the root of the software repository for the full text of the License. # ----------------------------------------------------------------------------------------------------------- -"""Unit tests for ChipBootstrapChannel (L2 bootstrap mailbox). +"""Unit tests for ChipBootstrapChannel (per-chip bootstrap mailbox). All tests run without hardware — pure shared-memory / in-process. """ diff --git a/tests/ut/py/test_worker/test_bootstrap_context_hw.py b/tests/ut/py/test_worker/test_bootstrap_context_hw.py index 4c20a60c6..c444be645 100644 --- a/tests/ut/py/test_worker/test_bootstrap_context_hw.py +++ b/tests/ut/py/test_worker/test_bootstrap_context_hw.py @@ -7,9 +7,9 @@ # See LICENSE in the root of the software repository for the full text of the License. # ----------------------------------------------------------------------------------------------------------- # ruff: noqa: PLC0415 -"""Hardware smoke test for ``ChipWorker.bootstrap_context`` (L5). +"""Hardware smoke test for ``ChipWorker.bootstrap_context``. -Drives the L5 one-shot bring-up against the real ``tensormap_and_ringbuffer`` +Drives the one-shot bring-up against the real ``tensormap_and_ringbuffer`` runtime on 2 Ascend devices. The critical assertions are: 1. ``bootstrap_context`` returns a non-null ``device_ctx`` and @@ -18,13 +18,13 @@ 3. A single ``ChipBufferSpec`` slices the window so ``buffer_ptrs[0] == local_window_base``. -Deliberately **no** ``comm_barrier``. The paired L1b UT +Deliberately **no** ``comm_barrier``. The paired ``comm_*`` UT (``test_platform_comm.py``) already shows the known HCCL 507018 path fails after ~52 s on some CANN builds; ``bootstrap_context`` does not issue a barrier, so this test completes on any build. Cross-rank synchronization between the two ranks is already enforced inside -``HcclCommInitRootInfo`` / the L1a root-info handshake that ``comm_init`` -performs, so the non-barrier invariants above are enough to prove the L5 +``HcclCommInitRootInfo`` / the root-info handshake that ``comm_init`` +performs, so the non-barrier invariants above are enough to prove the bring-up crossed both ranks. """ @@ -89,8 +89,9 @@ def _bootstrap_rank_entry( # noqa: PLR0913 result["actual_window_size"] = int(res.actual_window_size) result["buffer_ptrs"] = list(res.buffer_ptrs) - # Teardown mirrors the L6 ordering: shutdown_bootstrap (releases the - # HCCL comm handle) then finalize (releases ACL / unloads runtime). + # Teardown mirrors the Worker bootstrap loop ordering: shutdown_bootstrap + # (releases the HCCL comm handle) then finalize (releases ACL / unloads + # runtime). worker.shutdown_bootstrap() worker.finalize() result["ok"] = True @@ -173,7 +174,7 @@ def test_two_rank_bootstrap_context(st_device_ids): assert r["actual_window_size"] >= window_size, ( f"rank {rank}: actual_window_size={r['actual_window_size']} < requested {window_size}" ) - # 1:1 buffer-to-spec invariant — the contract L6's ChipContext relies on. + # 1:1 buffer-to-spec invariant — the contract ChipContext relies on. assert r["buffer_ptrs"] == [r["local_window_base"]], ( f"rank {rank}: buffer_ptrs={r['buffer_ptrs']} != [{r['local_window_base']}]" ) diff --git a/tests/ut/py/test_worker/test_bootstrap_context_sim.py b/tests/ut/py/test_worker/test_bootstrap_context_sim.py index 04008dbf5..fd766e896 100644 --- a/tests/ut/py/test_worker/test_bootstrap_context_sim.py +++ b/tests/ut/py/test_worker/test_bootstrap_context_sim.py @@ -7,7 +7,7 @@ # See LICENSE in the root of the software repository for the full text of the License. # ----------------------------------------------------------------------------------------------------------- # ruff: noqa: PLC0415 -"""Simulation-backend tests for ``ChipWorker.bootstrap_context`` (L5). +"""Simulation-backend tests for ``ChipWorker.bootstrap_context``. These tests run without any Ascend NPU. They drive the sim backend of the ``tensormap_and_ringbuffer`` runtime, whose ``comm_*`` lifecycle is backed by @@ -126,8 +126,9 @@ def _rank_entry( # noqa: PLR0913 worker.copy_from(ctypes.addressof(host_buf), res.buffer_ptrs[0], readback_nbytes) result["readback"] = bytes(host_buf) - # shutdown_bootstrap + finalize — matches the L6 teardown order - # and leaves the sim shm segment clean for the next test. + # shutdown_bootstrap + finalize — matches the Worker bootstrap + # loop's teardown order and leaves the sim shm segment clean for + # the next test. worker.shutdown_bootstrap() worker.finalize() result["ok"] = True @@ -227,7 +228,7 @@ def test_two_rank_no_host_inputs(self): assert r is not None and r.get("ok"), f"rank {rank} failed: {r and r.get('error')}" assert r["local_window_base"] != 0, f"rank {rank} local_window_base is 0" assert r["actual_window_size"] >= 4096 - # Single buffer at window base — the 1:1 contract L6 relies on. + # Single buffer at window base — the 1:1 contract ChipContext relies on. assert r["buffer_ptrs"] == [r["local_window_base"]] diff --git a/tests/ut/py/test_worker/test_platform_comm.py b/tests/ut/py/test_worker/test_platform_comm.py index 9f13441c4..02fca87fe 100644 --- a/tests/ut/py/test_worker/test_platform_comm.py +++ b/tests/ut/py/test_worker/test_platform_comm.py @@ -7,7 +7,7 @@ # See LICENSE in the root of the software repository for the full text of the License. # ----------------------------------------------------------------------------------------------------------- # ruff: noqa: PLC0415 -"""Hardware UT for ChipWorker.comm_* wrappers (Python surface of the L1a HCCL backend). +"""Hardware UT for ChipWorker.comm_* wrappers (Python surface of the HCCL backend). This is the Python twin of tests/ut/cpp/test_hccl_comm.cpp. It drives the full comm lifecycle entirely through ChipWorker's public Python API: @@ -28,7 +28,7 @@ per rank. The parent only waits on exit codes plus a small result queue used to surface CommContext field values. -Known issue inherited from L1a (HCCL 507018): on certain CANN builds +Known issue inherited from the HCCL backend (HCCL 507018): on certain CANN builds `HcclBarrier` + `aclrtSynchronizeStream` report 507018 after ~52s of timeout. That is a CANN-coupling bug tracked separately; this test treats a barrier failure as a warning and still asserts the non-barrier invariants (init/alloc @@ -143,10 +143,10 @@ def _rank_entry( result["rank_id"] = int(host_ctx.rankId) result["rank_num"] = int(host_ctx.rankNum) - # Barrier. L1a observed CANN error 507018 here on some builds; that - # bug is tracked independently. Surface the failure to the parent as - # a warning and continue with teardown so the non-barrier invariants - # above still gate this test. + # Barrier. The C++ HCCL UT observed CANN error 507018 here on some + # builds; that bug is tracked independently. Surface the failure to + # the parent as a warning and continue with teardown so the + # non-barrier invariants above still gate this test. try: worker.comm_barrier(comm) result["barrier_ok"] = True