Add: Worker-level chip bootstrap orchestration for distributed L3#613
Open
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
Open
Add: Worker-level chip bootstrap orchestration for distributed L3#613ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces worker-level chip bootstrap orchestration (L6). It adds the ChipContext dataclass and updates the Worker class to support asynchronous bootstrap of chip children via shared-memory mailboxes. Key changes include a new child process loop that executes bootstrap_context, a timeout-based polling mechanism in the parent to collect results, and enhanced cleanup logic to prevent shared-memory leaks on failure. New hardware and simulation tests are also provided. Feedback is provided regarding a potential silent truncation issue when zipping buffer pointers, suggesting an explicit length check to ensure data integrity.
8d9a468 to
43e439a
Compare
- Add ChipContext dataclass in task_interface (device_id/rank/nranks + device_ctx, local_window_base, actual_window_size, buffer_ptrs: dict by name) — exposed to L3+ orch code after a successful bring-up - Wire Worker(level>=3, chip_bootstrap_configs=[...]) so each chip child runs ChipWorker.bootstrap_context before entering the main task / control loop; parent blocks on a per-chip ChipBootstrapChannel until every chip reports SUCCESS, assembles ChipContexts, and fails fast on the first ERROR (best-effort SIGKILL + waitpid for the rest, shms unlinked so init() raises cleanly without leaking state) - Explicit length check before zipping cfg.buffers with the channel's buffer_ptrs, so a parent/child buffer-count disagreement raises a descriptive RuntimeError instead of silently producing a truncated buffer_ptrs dict in the ChipContext - Bootstrap mailboxes are allocated pre-fork (SharedMemory zero-fills -> IDLE) and unlinked *after* chip pids are reaped, since chip children touch the channel inside finalize() - Drop stale split-step labels (L2/L5/L6) from new code and from prior chip_bootstrap docstrings since they collide with the runtime Level 0-6 hierarchy documented in docs/hierarchical_level_runtime.md - Add sim UT (happy path + error path + validation + chip_contexts- before-init guard) and hardware UT (2-card, no comm_barrier so the HCCL 507018 known-issue stays off the critical path)
43e439a to
e70e464
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires
ChipWorker.bootstrap_contextinto theWorkerfactory so an L3Worker(level>=3, chip_bootstrap_configs=[...])brings up every chip child's communicator duringinit()and surfaces aChipContextlist to orch code before the firstrun().ChipContextdataclass intask_interface—device_id / rank / nranks / device_ctx / local_window_base / actual_window_size / buffer_ptrs: dict[str, int]. The per-buffer dict is built by zippingcfg.bufferswith the result'sbuffer_ptrs, so orch code addresses a named window slice without tracking list indices. A length check before the zip raisesRuntimeErroron a parent/child buffer-count mismatch instead of silently truncating.ChipBootstrapChannelmailbox (4096 B shared-memory, zero-filled so state starts IDLE) allocated pre-fork. Parent polls each channel withtime.sleep(0.001)+ 120 s soft timeout; on the firstERRORraisesRuntimeError(f"chip {idx} bootstrap failed: {channel.error_message}")and best-effort SIGKILLs every forked child + unlinks every shm soinit()raises cleanly without leaking state.chip_contextsis a property that raises beforeinit()._chip_process_loop_with_bootstraprunsbootstrap_contextfirst (channel publishesSUCCESS/ERROR), then enters the same task/control poll loop as_chip_process_loop.try/finallyrunsshutdown_bootstrapthenfinalizeonSHUTDOWN. Bootstrap failure returns viaos._exit(0)so the parent'swaitpidisn't confused by a non-zero exit code layered on top of the channel's error._worker.close()→SHUTDOWN→waitpid→ unlink sub/chip/next-level mailboxes → bootstrap mailboxes unlinked last, because chip children touch theirChipBootstrapChannelinsideshutdown_bootstrap()+finalize()._chip_process_loopand_Workerscheduler wiring are untouched; the bootstrap path is gated on a non-Nonechip_bootstrap_configsargument and runs eagerly atinit()time instead of the usual lazy_start_hierarchical()on firstrun().Does not extend to level-4+ recursive
Workerchildren — the_next_level_workersfork path is unchanged; adding distributed bring-up for nested Workers is a follow-up.Testing
tests/ut/py/test_worker/test_worker_distributed_sim.py— happy path + error path (bogus placement triggersRuntimeError) +chip_contexts-before-init guard +__init__validation (level<3 reject, length-mismatch reject).tests/ut/py/test_worker/test_worker_distributed_hw.py— 2-card hardware smoke, drivesWorker(level=3, chip_bootstrap_configs=[...])end-to-end, asserts each rank'sdevice_ctx != 0,local_window_base != 0,actual_window_size >= requested, andbuffer_ptrs == {"x": local_window_base}. Nocomm_barrier— HCCL 507018 stays off the critical path. Lives undertests/utso theut-a2a3job picks it up without xdist's per-worker device-slicing (which would break a 2-device request undertests/st).pytest tests/ut/py/test_workerwithchip_bootstrap_configs=Nonepaths — 59 green, no regression.Ref: #571 (split), builds on #608, #610.