Skip to content

CollectiveX: experimental cross-vendor collective/EP benchmark#1896

Open
Oseltamivir wants to merge 42 commits into
mainfrom
collectivex
Open

CollectiveX: experimental cross-vendor collective/EP benchmark#1896
Oseltamivir wants to merge 42 commits into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Adds CollectiveX under experimental/CollectiveX/ — a cross-vendor collective / expert-parallel benchmark — plus an orchestration-only workflow.

What it adds

  • Per-SKU launch adapters (launchers/launch_<sku>.sh, the launch_${RUNNER_NAME%%_*}.sh convention) that run any benchmark via a CX_BENCH selector (nccl|deepep|all) through a shared launchers/run_in_container.sh.
  • Benchmarks: run_nccl.py (stock nccl-tests → parsed flat JSON), run_deepep.py (DeepEP dispatch/combine, normal mode), env_capture.py (Layer-0 provenance), plot.py. Every result is correctness-gated and carries a topology-aware comparison_key.
  • Single multi-arch, digest-pinned container for all NVIDIA SKUs (lmsysorg/sglang@sha256:4219…, amd64+arm64); DeepEP via rebuild-deepep. See CONTAINERS.md.
  • .github/workflows/collectivex-experimental.ymlpush to collectivex (paths experimental/CollectiveX/**) → GB200 NCCL smoke; workflow_dispatch → chosen sku+benchmark (B200, DeepEP, larger sweeps). Logic stays under experimental/.

Validated on hardware

  • NCCL primitives: B200 (8× NVLink island) + GB200 (4× NVL72 MNNVL), 4 ops, correctness-passed, topology-keyed distinctly.
  • DeepEP dispatch/combine on GB200: correctness-gated (token conservation + combine vs DeepEP's own reference), ~154 µs roundtrip, 1.66M tok/s.
  • Local: shellcheck/bash -n, py_compile, actionlint, parser fixtures.

Notes / deferred

  • Result JSONs are gitignored (captured env embeds hostnames/UUIDs); CI uploads them as workflow artifacts. Headline numbers are summarized in CONTAINERS.md.
  • Importing the exact multi-arch digest needs the runner's registry creds (validated on the pre-staged v0.5.11-cu130).
  • Precision axes (NVFP4/MXFP8/…), low-latency EP, MoRI, EPLB, multinode DeepEP, and other collectives are captured as roadmap in plan.md, not built.

Note

Low Risk
Changes are isolated to experimental/CollectiveX/ and a read-only workflow; no production benchmark matrix or serving launchers are modified. Risk is mainly operational (self-hosted GPU time, Slurm/enroot failures) rather than app or security impact.

Overview
Introduces CollectiveX under experimental/CollectiveX/ — an experimental cross-vendor collective and MoE EP benchmark — plus orchestration-only .github/workflows/collectivex-experimental.yml. Production serving paths are untouched.

Benchmark stack: run_nccl.py wraps nccl-tests/rccl-tests into provenance-tagged JSON; run_deepep.py and run_mori.py add correctness-gated DeepEP and AMD MoRI dispatch/combine; env_capture.py, summarize.py, and plot.py handle environment capture, CI summaries, and plots. Results use topology-aware comparison_keys so unlike fabrics are not merged blindly.

Execution: Per-SKU Slurm launchers (launch_b200-dgxc.sh, launch_gb200-nv.sh, launch_b200-dgxc-slurm.sh, launch_mi355x-amds.sh) follow the same launch_${RUNNER_NAME%%_*}.sh pattern as serving, with shared common.sh (enroot squash by tag, optional CX_STAGE_DIR rsync, in-container nccl/rccl builds). CX_BENCH selects nccl, deepep, mori, or all via run_in_container.sh.

CI: Push to collectivex runs MI355X MoRI on mi355x runners; workflow_dispatch picks SKU and benchmark (GB200/B200 NCCL, DeepEP, etc.), writes markdown to the job summary, and uploads gitignored results/*.json as artifacts.

Reviewed by Cursor Bugbot for commit 871086d. Bugbot is set up for automated code reviews on this repo. Configure here.

Per-SKU launch adapters (launch_<sku>.sh) that run any benchmark via a CX_BENCH selector through a shared run_in_container.sh; multi-arch digest-pinned sglang container; NCCL-primitive + DeepEP dispatch/combine benchmarks with provenance + correctness gating; and an on:push workflow (GB200 NCCL smoke; workflow_dispatch for B200/DeepEP/larger sweeps).

Validated on hardware: NCCL primitives on B200 (8x NVLink) and GB200 (4x NVL72 MNNVL); DeepEP dispatch/combine on GB200 (correctness-gated).
Comment thread experimental/CollectiveX/launchers/run_in_container.sh Outdated
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/plot.py Fixed
Comment thread experimental/CollectiveX/run_deepep.py Fixed
The GB200 on:push smoke hung 25 min in enroot import: a bare digest ref (repo@sha256:) can't form an anonymous Docker Hub token scope, so enroot prompted for a password and blocked in non-interactive CI. Import by the multi-arch TAG instead (anonymous auth works, same as the serving launchers) and add </dev/null so a missing token fails fast rather than hanging.

Use v0.5.11-cu130 (multi-arch amd64+arm64, index sha256:061fb71f…): v0.5.12-cu130's 62 layers overflow enroot's overlay-based squash creation on these nodes (failed to mount overlay … Invalid argument). v0.5.11-cu130 imports cleanly and is pre-staged on GB200.
Comment thread .github/workflows/collectivex-experimental.yml
Comment thread experimental/CollectiveX/run_nccl.py Outdated
On the GB200 Actions path, CX_STAGE_DIR makes the launcher rsync the tree to compute-visible Lustre and the container writes results/ there; upload-artifact reads the checkout's results/ (empty), so the green smoke produced no artifact. Add cx_collect_results to copy result JSONs from the stage dir back to the checkout after the run (no-op when no staging was used).
Comment thread experimental/CollectiveX/run_deepep.py Outdated
Comment thread experimental/CollectiveX/launchers/launch_gb200-nv.sh Outdated
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Comment thread experimental/CollectiveX/run_deepep.py Outdated
is_token_in_rank=is_token_in_rank,
num_tokens_per_expert=num_tokens_per_expert,
)
combined_x, _, _ = buffer.combine(recv_x, handle, topk_weights=recv_topk_weights)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatch dtype not applied

Medium Severity

The --dispatch-dtype / CX_DISPATCH_DTYPE value is stored in result metadata but never used when building inputs or calling buffer.dispatch. Runs always use bfloat16 token tensors regardless of fp8 vs bf16, so provenance and comparison keys can describe a different shape than what was measured.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b384171. Configure here.

summarize.py --markdown emits GitHub-flavored markdown tables (NCCL + DeepEP); a per-job 'Results summary' workflow step appends it to $GITHUB_STEP_SUMMARY so the run page shows a rendered table (per the GitHub job-summaries feature). Plain-text mode still drives the in-container result gate.
--timestamp "$TS" || cx_log "WARN: parse $op failed"
done

cx_log "done — JSON artifacts under $CX_DIR/results/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode launcher ignores failures

High Severity

The B200 multinode adapter logs warnings when srun or run_nccl.py fail but always exits successfully. Unlike run_in_container.sh, it never runs summarize.py as a non-zero gate, so workflow_dispatch on b200-multinode can finish green with no valid NCCL results.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

dst="$repo_root/experimental/CollectiveX/results"
mkdir -p "$dst"
cp "$mount_src/experimental/CollectiveX/results/"*.json "$dst/" 2>/dev/null || true
cx_log "copied results from stage dir -> $dst (for artifact upload)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result copy errors ignored

Medium Severity

cx_collect_results wraps the staged-to-checkout cp in 2>/dev/null || true and always logs success, so a failed or empty copy does not affect the launcher exit code and the workflow can pass without uploadable JSON.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

First AMD / cross-vendor reach, scaffolded ahead of Milestone 1:

- run_mori.py: MoRI dispatch+combine (normal mode), correctness-gated,
  mirroring ROCm/mori's dispatch_combine example — int32 routing indices,
  (n,0) fp8 scales, the zero-copy registered-combine-input-buffer staging
  step, and expected = input x (#unique destination ranks). Emits the same
  flat JSON shape (family=moe, backend=mori) with CUDA-event timing.
- launchers/launch_mi355x-amds.sh: AMD adapter — partition compute, no
  account, --cpus-per-task=128, node-local /var/lib/squash imported via srun
  on the allocated node, --container-writable --container-remap-root, forces
  CX_BENCH=mori, mounts the (compute-visible) checkout at /ix.
- launchers/run_in_container.sh: run_mori_suite + mori case (nccl|deepep|mori|all).
- launchers/common.sh: ROCm MoRI image (rocm/sgl-dev:...-mori-0227-2) in
  cx_default_image for mi355x*/mi350x*/mi325x*/mi300x*.
- workflow: mi355x sku + mori benchmark options for workflow_dispatch.
- docs: CONTAINERS.md AMD section, README files/run/risks, plan.md status.

Not yet hardware-validated (no MI355X access) — MoRI's Python API is
version-sensitive (marked ADAPT HERE); the first runner job is the
validation, as GB200 was for DeepEP. The ROCm image isn't digest-pinned yet.
Comment thread experimental/CollectiveX/run_mori.py Fixed
- workflow: replace the on:push GB200 NCCL smoke with the MI355X MoRI
  dispatch/combine run (runs-on: mi355x, CX_BENCH=mori), and name the job
  "CollectiveX Experimental" (no longer "smoke"). GB200/B200 NCCL + DeepEP
  remain on workflow_dispatch.
- launch_mi355x-amds.sh: adapt more faithfully to runners/launch_mi355x-amds.sh
  — squeue by job-name only (no -u), flock -w 600, and clear ROCm gpucore.*
  dumps after the run so the next checkout is clean. Bump default CX_TIME to 60
  for a cold ROCm-image import.
- summarize.py: drop the "N/N results valid." footer from both the job-summary
  (markdown) and plain output; the failure gate still reports invalid results.
  Relabel the MoE section "MoE dispatch+combine (DeepEP / MoRI)".
- docs: README/plan describe push -> MI355X MoRI.
rm -f \"$SQUASH_FILE\"
enroot import -o \"$SQUASH_FILE\" \"docker://$IMAGE\" </dev/null
fi
"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI355X import errors ignored

High Severity

The node-local enroot import runs inside an srun bash snippet without set -e and with no check after import. A failed import still yields exit 0 from that snippet, so the job continues into pyxis with a missing or corrupt squash file.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }}
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

… default)

First MI355X run reached the MoRI dispatch kernel — salloc, ROCm-image import,
mount, torchrun, 8-rank Gloo + shmem init, and EpDispatchCombineConfig/op/dispatch
all worked, confirming the API signatures. It OOM'd MoRI's default 2 GiB static
symmetric heap (hidden=7168 dispatch/combine buffers across 8 ranks request
~0.9 GiB each).

run_mori.py now sets MORI_SHMEM_HEAP_SIZE before `import mori` (default 16 GiB,
override CX_MORI_HEAP_BYTES). Docstring + CONTAINERS.md record the finding;
correctness/timing validated by the heap-sized re-run.

salloc --partition="$PARTITION" --exclude="$EXCLUDE_NODES" --gres=gpu:"$NGPUS" \
--exclusive --cpus-per-task=128 --time="$TIME_MIN" --no-shell --job-name="$RUNNER_NAME"
JOB_ID="$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slurm job ID not scoped

Medium Severity

launch_mi355x-amds.sh resolves JOB_ID with squeue --name="$RUNNER_NAME" and no -u "$USER", while the other CollectiveX NVIDIA launchers filter by user. On a shared cluster, the first matching job name may belong to another account, so subsequent srun/scancel can target the wrong allocation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ac3f1b9. Configure here.

The heap-bump run cleared the 2 GiB OOM but then failed registering the 16 GiB
symmetric heap as an RDMA memory region (errno 22 EINVAL, size=17179869184).
ROCm/mori's reference test uses MORI_SHMEM_HEAP_SIZE="6G" single-node — big
enough for the hidden=7168 dispatch/combine buffers, small enough to register.

Match it: default "6G" (override CX_MORI_HEAP_SIZE). The rest of the config
already matches the reference (max_num_inp_token_per_rank=4096, hidden=7168,
backend cpu:gloo,cuda:nccl), so this lands on the proven single-node setup.
Drove run_mori.py to a correct run on 8x MI355X (on-node via salloc+srun):
dispatch+combine numerically correct (combine within tol, max_rel ~2e-3),
~85us round-trip at the decode shape. The first runs surfaced four issues,
all fixed and re-validated:

- RDMA MR ceiling: MoRI registers the WHOLE symmetric heap as one RDMA MR at
  init (even single-node; no disable-RDMA knob). The ionic_rdma NICs cap GPU
  MRs at ~4 GiB — a 6 GiB heap fails (RegisterRdmaMemoryRegion errno 22), 2 GiB
  registers. Hold heap at MORI_SHMEM_HEAP_SIZE=2G (override CX_MORI_HEAP_SIZE).
- Buffer sizing: max_num_inp_token_per_rank 4096 -> max(512, n) so the buffers
  fit the 2 GiB heap (4096 was inherited from the reference test).
- Correctness shape: combine returns the full max-token buffer; compare only
  combined[:n] against expected.
- recv count: read total_recv BEFORE combine (combine resets recv_num, which
  made recv_nonzero a false negative).
- Teardown: MoRI's shmem teardown asserts (CheckStatusValid -> SIGABRT) when the
  op is destroyed after shmem_finalize(); hard-exit after writing results.

Docs (README/plan/CONTAINERS) updated from "scaffolded" to validated, with the
fabric constraints recorded.
Comment thread experimental/CollectiveX/run_mori.py Fixed
Comment thread experimental/CollectiveX/run_mori.py Fixed
…CH=nccl)

Adds the AMD collective-primitive path so all_reduce/reduce_scatter/all_gather/
alltoall run on MI355X, not just MoRI:

- common.sh: cx_build_rccl_tests — clones ROCm/rccl-tests and builds with `make`
  against /opt/rocm (amdclang++/librccl). It's a nccl-tests fork producing the
  same <op>_perf binaries and output format, so run_nccl.py parses it unchanged.
  Validated building + running all 4 ops in-container on MI355X (correctness OK).
- run_in_container.sh: run_nccl_suite picks rccl-tests on ROCm (/opt/rocm or
  hipcc), nccl-tests otherwise; identical op loop + run_nccl.py invocation.
- launch_mi355x-amds.sh: honor CX_BENCH (mori default | nccl) instead of forcing
  mori; same -g N single-node 8-GPU launch.
- docs: README/CONTAINERS note the rccl path.

B200 already has the nccl path; this makes primitives available on all three
SKUs via workflow_dispatch.
Comment thread experimental/CollectiveX/launchers/launch_mi355x-amds.sh
if name:
devices.append(name)
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ibstat fallback may crash capture

Low Severity

In _rdma, the ibstat -l branch calls _run twice. If the first call succeeds but the second returns None, None.splitlines() raises and env_capture.py aborts before writing provenance JSON for that run.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b23573. Configure here.

…on-node

launch_gb200-nv.sh now branches on CX_NODES: 1 (default) keeps the single-tray
4-GPU dispatcher path; >1 runs across the NVL72 NVLink fabric (e.g. CX_NODES=2
= 8 GPU) by building nccl-tests MPI=1, running each op across WORLD ranks via
`srun --mpi=pmix` (1 GPU/rank) with the MNNVL env, and parsing on the login node
— mirroring launch_b200-dgxc-slurm but staying on NVLink instead of IB.

Validated on GB200 (2x watchtower-navy trays, 8 GPU): all 4 ops valid, peak
busbw all_reduce 822.8 / reduce_scatter 670.6 / all_gather 651.2 / alltoall
625.0 GB/s — ~30% over single-tray and on par with B200 8-GPU NVLink, i.e.
MNNVL engaged (not an IB fallback).

- common.sh: cx_build_nccl_tests auto-detects MPI_HOME for MPI=1 (Debian OpenMPI
  headers live under /usr/lib/<arch>/openmpi/include; MPI_HOME=/usr fails). Works
  x86_64 + aarch64.
- launch_b200-dgxc-slurm.sh: fix BUILD_IN_CTR path (.nccl-tests/nccl-tests/build).
- workflow: add `nodes` dispatch input -> CX_NODES.
… pin)

The MI355X MoRI jobs failed in CI when they landed on cold nodes: the squash
lock was created next to the squash in /var/lib/squash, which is root/admin-owned
on some nodes (flock -> "Bad file descriptor"), and nodes without the node-local
squash need a slow cold import that also hits lock/cache permissions.

- launch_mi355x-amds.sh: put the import lock in a guaranteed-writable per-node
  dir (CX_LOCK_DIR, default /tmp), not beside the squash; add CX_NODELIST to pin
  the allocation to nodes that already hold the squash.
- workflow: pin MI355X jobs (push + dispatch) to the warm-squash nodes
  (mia1-p01-g10,g15). Widen once the squash is staged cluster-wide.

The EP sweep itself is already hardware-validated (MoRI decode + prefill); this
only fixes squash setup so the jobs reach it in CI.
…an-out, comm-only timing

Address an expert review of the bring-up artifact: it measured a backend-specific,
non-deterministic, fan-out-1 workload with backend-specific staging in the timed region.
This reworks the EP harness into a defensible cross-vendor measurement.

- tests/routing.py (new): ONE deterministic routing trace, seed-fixed and indexed by
  global token id, identical on every SKU; adapters materialize their slice (no RNG in
  adapters — MoRI now honors routing). Real trace classes with PUBLISHED fan-out:
  * uniform (new default) — random distinct top-k, realistic fan-out ≈5.3 over EP8;
  * balanced — load-equalized, one-expert-per-rank (fan-out = ep_size);
  * balanced-rank-local — the old degenerate (i*topk+j)%E, fan-out 1, honestly named;
  * zipf. Each point records mean/max fan-out, fan-out histogram, routed copies,
    expert-load min/mean/max, and a routing-trace hash.
- tests/ep_harness.py: per-iteration cross-rank MAX then percentile (median_i(max_r),
  not max_r(median_i)); comm-only-v1 contract (staging untimed both backends); SERIAL =
  dispatch + combine sum (renamed from "round-trip"; not an independent chained op);
  fabric/clock warm-up before the timed sweep; provenance gate (fail on unknown);
  bandwidth = total routed bytes across ranks / latency; dropped num_ep_groups.
- tests/ep_deepep.py / ep_mori.py: materialize the shared-trace slice; single DeepEP
  NVL buffer size for all points (fixes the decode/prefill T=128 mismatch); honest
  per-backend resource provenance (DeepEP num_sms; MoRI block/warps; device SM/CU).
- launchers/launch_h200.sh (new): H200 EP8 adapter (open scheduler, NFS home, image
  imported on first use) — unblocks the NVIDIA EP8 side without B200/GB200 contention.
- plot_ep.py: v2 schema, separate EP panels, fan-out in hover, "selected stack /
  not resource-normalized / serial-is-a-sum" caption; summarize.py: matching columns.

Validated on hardware (EP8, identical deterministic trace, comm-only): H200 DeepEP
(fan-out ≈5.3, T=512 routed ≈312 MB, all points correct) and MI355X MoRI (decode +
prefill). Selected-stack at each backend's default budget — NOT yet resource-normalized
or best-available (DeepEP V2 / MoRI auto-tuned). Replaces the bring-up routing/timing.
series = []
for path in sorted(glob.glob(os.path.join(results_dir, "**", "*.json"), recursive=True)):
try:
d = json.load(open(path))
Comment thread experimental/CollectiveX/tests/ep_harness.py Fixed
Comment thread experimental/CollectiveX/tests/ep_deepep.py Fixed
Review: the cross-vendor curves used each backend's default budget (DeepEP 24 SMs vs
MoRI 80 blocks) — neither normalized nor tuned. Add an explicit --resource-mode and
record the applied budget + device fraction in provenance and the comparison_key (so
normalized / tuned / default are distinct lines):

- normalized (new default): restrict comms to ~--sm-fraction of each device's units —
  DeepEP set_num_sms(round(frac·SMs)); MoRI block_num≈round(frac·CUs). Fraction-based,
  recorded; an approximate apples-to-apples (architectural occupancy differs), not a
  claim of identical work. Validated on H200: 0.18 → 24/132 SMs.
- tuned: each backend's own recommended budget. DeepEP uses its analytic default
  Buffer.num_sms (=20 on 1.2.1; get_dispatch_config exists but doesn't expose num_sms
  to Python, and the default already reflects it); MoRI uses its default 80 (the
  0227-2 build has no launch auto-tuning API — labeled tuned_source). Validated on H200.
- default: the bring-up budget (DeepEP --num-sms, MoRI 80).

Honest scope: this is resource-FRACTION-normalized for the installed stacks; it is not
yet "best-available" (DeepEP V2 ElasticBuffer / MoRI launch auto-tuning would need newer
builds). Provenance records resource_mode, num_sms/block_num, device SMs/CUs, fraction,
tuned_source.
… hang)

Realistic (fan-out≈5.3) routing on MoRI/mori-0227-2 WEDGES at T>=32 — a push-triggered
job hung ~1 h before the 90-min job timeout. DeepEP/H200 is unaffected; the old
fan-out-1 routing completed, so the fan-out fix exposed this MoRI behavior. Two guards:

- run_in_container.sh: wrap the torchrun in `timeout -k 30 ${CX_RUN_TIMEOUT:-900}` so a
  wedged collective FAILS FAST instead of burning the whole job timeout.
- workflow push job: MoRI smoke capped to T<=16 (the known-good range) + CX_RUN_TIMEOUT=600,
  decode only. The full sweep stays on workflow_dispatch. Remove the cap once the MoRI
  T>=32 fan-out hang is root-caused/fixed.
…below ~80

Root-caused the MoRI T>=32 hang (the 1 h push-job stall): it is NOT the realistic
fan-out routing (uniform at the default 80 blocks completes T=32/64 cleanly, fan-out
≈5.3, validated on MI355X g15) — it is the NORMALIZED block_num. Reducing MoRI's comm
blocks toward the device fraction (0.18·256≈46) wedges dispatch/combine at T>=32; 80
works. MoRI needs more launch parallelism than DeepEP and cannot be normalized to
DeepEP's 18%.

Fix: floor MoRI's normalized block_num at a functional minimum (CX_MORI_MIN_BLOCKS,
default 80) and record block_num_target / block_num_floored in provenance. So the
"normalized" regime is DeepEP at the target fraction vs MoRI at its functional floor
(documented as NOT a matched fraction — MoRI deadlocks lower). The fail-fast timeout
guard (prior commit) plus this floor mean normalized runs complete instead of hanging.
…mework

DeepEP backend gains a real FP8 normal-mode path (per-token block-128 cast,
untimed -> fp8_in_timing=False) and low-latency mode (low_latency_dispatch/
combine; in-kernel fp8 cast -> fp8_in_timing=True; 3D expert-major recv;
re-dispatch per combine sample). Capabilities are declared per backend and
run_ep.py REJECTS anything outside them BEFORE construction (no silent fallback
or mislabel): DeepEP = {bf16,fp8}x{normal,ll}, MoRI = {bf16}x{normal},
num_ep_groups>1 refused.

Harness: dtype-aware correctness tolerance (fp8 1.25e-1 vs bf16 5e-2, recorded);
combine bytes counted at real dtype (fp8 dispatch 1B + bf16 combine 2B => 1.5x
round-trip, not 2x); reproduction block (command, image, image_digest, seed,
warmup, iters, dispatch_dtype, mode, fp8_quant_in_timing).

Capability probes (probe_deepep_caps/ll, probe_mori_caps) document the API
surface + the runtime feasibility checks that gate the caps.

Launchers: launch_b300.sh (B300/batch_1/account benchmark, /data shared FS);
launch_h200.sh fixed for the hpc-gpu-1 partition + /mnt/nfs compute-visible
share (login /home is not compute-visible). _validate_*/_mi355x_orchestrate
are the SSH tight-loop validation drivers.

SSH-validated 8-GPU: H100 full matrix (bf16/fp8 x normal/ll, decode+prefill,
tuned+normalized) all valid; B300 normal bf16/fp8 valid; MI355X MoRI bf16 valid.


if __name__ == "__main__":
sys.exit(main())


if __name__ == "__main__":
raise SystemExit(main())
Comment thread experimental/CollectiveX/tests/probe_mori_caps.py Fixed
… + LL/fp8 modes

B300 perf was a warmup artifact, not a kernel problem: deep_ep 1.2.1 ships native
sm_100 (Blackwell) cubins, but at warmup=8 B300's dispatch read ~1787us (cold GPU
clocks / unestablished NVLink) vs ~85us steady-state (3-run spread 2.5%, faster than
H100). Raise the harness --warmup default 20->32 and add a sustained clock-ramp burst
(CX_FABRIC_WARM_BURST=60) at the largest warm shape so every timed point — including
small T — is steady-state. No deep_ep rebuild needed.

Workflow: add sku options h100-dgxc + b300, add mode (normal|ll) + resource_mode
(normalized|tuned|default) inputs, thread as CX_MODE/CX_RESOURCE_MODE (run_in_container
already forwards them to run_ep.py). Launchers: launch_h100-dgxc-slurm.sh (DGX Cloud
conventions, matches the h100-dgxc-slurm_NN runner) and launch_b300-nv.sh (shim to
launch_b300.sh for the b300-nv_NN runner).

summarize.py headline gains mode / dtype / resource columns so LL-vs-normal,
fp8-vs-bf16, and normalized-vs-tuned variants are distinguishable; dtype marker
fp8*/fp8+ shows whether the fp8 cast is inside the timed window (LL) or untimed (normal).
… configs

h100-dgxc-slurm GH runner is the SAME cluster validated over SSH (hpc-gpu-1): set
partition hpc-gpu-1, account customer, exclude hpc-gpu-1-7 (from runners/launch_
h100-dgxc-slurm.sh). Squash dir -> /mnt/nfs/sa-shared/containers (compute-visible;
/home is login-local here, so the prior gpu-2 default also pointed the squash at a
node-invisible path). b300: exclude b300-018 (known-bad, per runners/launch_b300-nv.sh).
…ility driver

concurrency cancel-in-progress true -> false: same-SKU workflow_dispatch runs now
QUEUE instead of cancelling, so a 3-run reproducibility sweep on one SKU runs all
three (previously the later dispatches silently cancelled the earlier ones).

_repro.sh runs the acceptance points (decode T=64, prefill T=512) three times each
in a single allocation and prints per-run dispatch/serial p50 + the (max-min)/min
spread so the <=10% bar is directly checkable.
…B300 cold sweep

The one-time warm-up burst had two problems: (1) it WEDGED MoRI (sustained dispatch/
combine bursts deadlock it), and (2) on Blackwell it only warmed the first point — the
tiny small-T points then let clocks drop, so a mid-sweep T=64 still read ~20x cold.

Replace it with a PER-POINT burst inside the timed loop, re-ramping clocks at each
shape so every point is steady-state regardless of sweep position, gated by
backend.wants_warm_burst (DeepEP=True; MoRI=False — it wedges and is already steady).
torch.cuda.synchronize()
try:
dist.barrier()
except Exception:
…ovenance gate)

The provenance gate correctly rejects MoRI runs with mori_commit=unknown; the repro
orchestrator must export COLLECTIVEX_IMAGE so the commit pins to the image tag. _repro.sh
now logs torchrun output per run (was /dev/null, hiding the gate rejection).
…RI iters

MoRI's needs_gradual_ramp expands a single-point ladder to [1..T], so rows[0] was T=1
not T=64; pick the row whose tokens_per_rank==T. MoRI also wedges under 200 iters at
T>=32 (the validated count is 40), so the MoRI repro runs WARMUP=8 ITERS=40.
The single-point _repro.sh path wedges MoRI mid-ramp on the contended MI355X cluster
(unkillable D-state procs). _mori_repro.sh mirrors the proven _validate_mori.sh
invocation (full ladders, warmup 8, iters 40) with a short per-run timeout, run 3x,
extracting T=64/T=512. Orchestrator excludes poisoned nodes via CX_EXCLUDE_NODES.
…p99, routing identity

Addresses review #3 methodology critiques (schema_version 3):

- Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform,
  rather than each choosing its own timing boundary. layout-and-dispatch-v1 times
  get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is
  computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so
  dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The
  misleading "comm-only-v1" label is gone.

- Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized
  per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump
  wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline).
  p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.)

- Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights;
  cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the
  identical trace, else status=invalid. Added per-dest-rank send histogram.

- Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at
  their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of
  isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty).

- Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
…tract/run metadata

- capability.py (stdlib): static table mirroring adapter SUPPORTED_* sets; resolves
  (sku->vendor, backend, mode, dtype, contract) -> valid/why. Workflow runs it as a
  fail-fast "Validate capability" gate BEFORE consuming a runner (review #3 #2).
- NCCL/RCCL phase-dedup: matrix collapses to a single 'na' job for collective backends
  (phase is meaningless for nccl/rccl — was running identical work twice).
- contract input + CX_MEASUREMENT_CONTRACT threaded through run_in_container -> run_ep;
  CX_TRIALS too. COLLECTIVEX_SOURCE_SHA + GHA run id/attempt reach the artifact (run
  linkage, review #3 #1). run_ep reads GITHUB_SHA as the source-sha fallback.
… rate, run links

Addresses review #3 frontend critiques (backward-compatible with v2 docs):
- Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles.
- Suite selector backend-default vs resource-constrained — kept distinct, never read as
  one fair contest (#5). dtype/mode/resource/contract are all in the per-line label +
  hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10).
- Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine
  bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7).
- Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that
  produced the point (#1). Provenance text no longer claims a single dtype (the
  "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count,
  logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
…ing-identity on HW)

Confirms on 8xH100: schema 3, routing_consistent=True (identical trace_sig+idx_hash across
configs), pooled p50/p90/p99 (120 samples), BOTH contracts (cached-layout is ~14% faster —
the get_dispatch_layout cost it hoists out, now explicit), and separated logical bytes
(fp8 dispatch 19.5MB vs bf16 combine 39MB).
Comment thread experimental/CollectiveX/tests/capability.py Fixed
…h, CX_DRIVER orchestrator)

Headline v3 matrix per SKU: trials=3 pooled p50/p90/p99, both contracts (normal), routing-identity
gate. Used for the H100/H200/GB300/MI355X v3 re-run. GB300 runs EP4 (4 GPU/node) normal-only.
… n~600)

p99 from ~600 pooled samples is the ~6th-largest value -> high-variance, jagged across
token counts (compounded by the cross-rank MAX). p50 is the stable/representative headline;
p90 is the steadier tail read; p99 still in the selector. A smooth p99 needs ~thousands of
iters (config bump). Default view changed p99 -> p50.
X scale is now a selector like Y scale (defaults Log — the sweep is geometric — with Linear
available). Grid panels re-render on toggle changes (pct/suite/x-scale/y-scale) so they stay
in sync with the explorer; panel header shows the active scale (e.g. log-log / lin-log).
…mplete prefill panels)

DeepEP prefill ladder is [128,256,512] but MoRI's gradual ramp expands its prefill to
[1..512], so DeepEP lines looked 'incomplete' (clustered at the right) next to MoRI in the
prefill panel. load_series now prepends each config's decode-range (T<min-prefill-T) points
to its prefill curve — decode+prefill are the same kernel at different token regimes, one
continuous latency-vs-T curve. Idempotent; boundary join verified within ~2% (reproducible).
GB300 was EP4-only (single tray); add EP8 across 2 trays. A read-only probe (tests/_gb300_ep_probe.py) settled the topology: DeepEP treats <=8 ranks on the NVL72 MNNVL fabric as one NVLink domain, so the intranode Buffer(group,nvl,0) path works UNCHANGED across 2 nodes -- no internode/NVSHMEM/rebuild (internode-normal is asserted out until >8 ranks). launchers/_gb300_ep8.sh runs the v3 matrix at EP8 via srun --ntasks=8 (per-rank RANK/LOCAL_RANK from SLURM_*, no torchrun); all 8 configs valid, correctness-gated, routing-consistent, fanout~5.3. LL runs too but regresses vs normal over the inter-tray hop.

plot_ep.py: a SKU can now span EP degrees (GB300 EP4+EP8), so key the decode->prefill stitch on (ckey,ep,phase) not (ckey,phase), and put EP in the label + ckey so EP4/EP8 are distinct in the all-EP overlay. EP8 panels now overlay gb300+h100+h200+mi355x.

_v3_mori.sh: MoRI timeout tuning (trials 2 x iters 40, CX_RUN_TIMEOUT=1100) to fit the combine-redispatch ramp under the wall clock.
buf = Buffer(group, 1 * 1024**3, 0)
try:
Buffer.set_num_sms(24)
except Exception:
buf = Buffer(group, 1 * 1024**3, 1 * 1024**3)
try:
Buffer.set_num_sms(24)
except Exception:
log(f"FAIL(non0): {exc!r}")
try:
dist.barrier()
except Exception:
finally:
try:
dist.destroy_process_group()
except Exception:
…oad balancer

Adds the balanced-vs-unbalanced-vs-EPLB comparison. tests/eplb.py is a DeepSeek-style EPLB: greedy replicate-hot-experts-by-load + equal-cardinality balanced packing, numbered RANK-MAJOR so the contiguous expert->rank placement reproduces the balanced placement -- a pure routing-trace transform, no adapter change. --eplb/--num-redundant-experts in the harness (256 logical -> 288 physical); run_ep.py sizes the backend for the physical count, ep_harness build_trace() remaps the logical trace; the doc records the per-rank load imbalance EPLB removes (4.92x->1.00x). plot_ep.py gains a Routing selector (uniform/balanced/zipf/zipf+eplb) with routing in label/ckey.

Validated on H100/H200/GB300 EP8 (balanced+zipf+zipf+eplb, decode+prefill, all valid/routing-consistent). EPLB rebalances load everywhere (recv_max 4094->2751 @t512) but the latency payoff is fabric/regime dependent: H100/H200 (flat NVLink) win +9%/+14% p50 at large prefill T; GB300 (2-tray MNNVL) wins at decode (+7%) but regresses at large prefill T as hot-expert replicas spread across trays. balanced (fan-out 8) > zipf (fan-out ~3.4) latency at large T (data moved dominates).

Drivers: _routing_rerun.sh (single-node torchrun), _gb300_routing.sh (2-node srun), _singlenode_orchestrate.sh, _routing_mori.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant