Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
83761d0
Add CollectiveX experimental cross-vendor collective/EP benchmark
Oseltamivir Jun 23, 2026
b7ed913
CollectiveX: import container by multi-arch tag, fix CI import hang
Oseltamivir Jun 23, 2026
e6fdd84
Merge branch 'main' into collectivex
Oseltamivir Jun 23, 2026
ccfae8e
CollectiveX: copy staged results back to checkout for artifact upload
Oseltamivir Jun 23, 2026
b384171
CollectiveX: per-job summary table + address PR review findings
Oseltamivir Jun 23, 2026
f48daed
CollectiveX: render results as a GitHub Actions job summary
Oseltamivir Jun 23, 2026
be9cc91
CollectiveX: add MI355X / MoRI EP path (dispatch+combine)
Oseltamivir Jun 23, 2026
d8ee9bf
CollectiveX: run MI355X MoRI on push; align launcher with serving script
Oseltamivir Jun 23, 2026
ac3f1b9
CollectiveX: size MoRI symmetric heap (first MI355X run hit the 2 GiB…
Oseltamivir Jun 23, 2026
46208f2
CollectiveX: set MoRI heap to 6G (16 GiB failed RDMA MR registration)
Oseltamivir Jun 23, 2026
b62de99
CollectiveX: MoRI MI355X validated on hardware; fix heap/buffer/teardown
Oseltamivir Jun 23, 2026
481ef59
CollectiveX: wire rccl-tests collective primitives for MI355X (CX_BEN…
Oseltamivir Jun 23, 2026
78322de
CollectiveX: key dispatch concurrency by SKU so B200/MI355X runs don'…
Oseltamivir Jun 23, 2026
2b23573
CollectiveX: render busbw & latency vs bytes/rank sweep tables in the…
Oseltamivir Jun 23, 2026
a3a492c
CollectiveX: GB200 8-GPU multi-node MNNVL path (CX_NODES), validated …
Oseltamivir Jun 23, 2026
871086d
CollectiveX: fix multi-node build cache (MPI=0 vs MPI=1) + gate all-z…
Oseltamivir Jun 23, 2026
368cfbc
CollectiveX: EP dispatch/combine token sweep with separated timing (t…
Oseltamivir Jun 24, 2026
e2717a3
CollectiveX: make MI355X launcher CI-robust (writable lock dir + node…
Oseltamivir Jun 24, 2026
5c7b273
CollectiveX: fair-comparison EP rebuild — deterministic trace, real f…
Oseltamivir Jun 24, 2026
0052b11
CollectiveX: resource-normalized + tuned regimes for the EP comparison
Oseltamivir Jun 24, 2026
3a872a9
CollectiveX: fail-fast timeout guard + cap the MoRI push smoke (T>=32…
Oseltamivir Jun 24, 2026
5876ea0
CollectiveX: floor MoRI normalized block_num — it deadlocks at T>=32 …
Oseltamivir Jun 24, 2026
353c8ee
CollectiveX: FP8 dispatch + low-latency mode + reject-unsupported fra…
Oseltamivir Jun 24, 2026
3bc941c
CollectiveX: fix B300 warmup artifact + GHA matrix for h100-dgxc/b300…
Oseltamivir Jun 24, 2026
9f85d05
CollectiveX: fix h100-dgxc + b300 launcher slurm/storage from serving…
Oseltamivir Jun 24, 2026
c596882
CollectiveX: serialize same-SKU GHA dispatches + add 3-run reproducib…
Oseltamivir Jun 24, 2026
e71ef3c
CollectiveX: per-point clock-ramp burst (gated) — fixes MoRI wedge + …
Oseltamivir Jun 24, 2026
4e217f9
CollectiveX: MoRI repro/validation drivers pass COLLECTIVEX_IMAGE (pr…
Oseltamivir Jun 24, 2026
7a2f94f
CollectiveX: repro driver — match the T row (MoRI ramp-safe) + cap Mo…
Oseltamivir Jun 24, 2026
bbe0578
CollectiveX: dedicated MoRI repro driver (validation-exact invocation)
Oseltamivir Jun 24, 2026
f7b9d35
CollectiveX v3 measurement: explicit contracts, pooled-trial p50/p90/…
Oseltamivir Jun 25, 2026
1afd268
CollectiveX v3 workflow: capability resolver + NCCL phase-dedup + con…
Oseltamivir Jun 25, 2026
6122acb
CollectiveX v3 plotter: percentile + suite selectors, logical-payload…
Oseltamivir Jun 25, 2026
c136ec5
CollectiveX: v3 harness smoke driver (validates contracts/trials/rout…
Oseltamivir Jun 25, 2026
cf34cb3
CollectiveX: MoRI repro driver iters knob (MORI_ITERS, tighter fast-o…
Oseltamivir Jun 25, 2026
82ec864
CollectiveX: v3 re-run drivers (deepep _v3_rerun.sh + mori _v3_mori.s…
Oseltamivir Jun 25, 2026
cad380a
CollectiveX plotter: default to p50 (p99 too noisy a tail estimate at…
Oseltamivir Jun 25, 2026
81cddca
CollectiveX plotter: X-axis Log/Linear toggle (was hardcoded log)
Oseltamivir Jun 25, 2026
e97bc8b
CollectiveX plotter: auto-stitch decode range into prefill curves (co…
Oseltamivir Jun 25, 2026
6a3a185
chore: dispatch CollectiveX snapshot updates [skip ci]
Oseltamivir Jun 25, 2026
270b7b4
CollectiveX: GB300 EP8 across 2 NVL72 trays + EP-degree-aware plotter
Oseltamivir Jun 25, 2026
a6812dc
CollectiveX: routing axis (balanced/zipf) + EPLB expert-replication l…
Oseltamivir Jun 25, 2026
45c4570
CollectiveX v4 (goal Part 1 + scaffolding): workload identity, measur…
Oseltamivir Jun 25, 2026
600e909
CollectiveX: analyze_ep.py — operating-envelope analysis (skew penalt…
Oseltamivir Jun 25, 2026
171c7d1
CollectiveX: --workload-dir canonical-trace consumption + make_worklo…
Oseltamivir Jun 25, 2026
6dba193
CollectiveX: failure taxonomy (classify hang/OOM/registration/deadloc…
Oseltamivir Jun 25, 2026
8ff23bd
CollectiveX plotter: coverage table (publication status per measured …
Oseltamivir Jun 25, 2026
9e52693
CollectiveX: provenance enrichment (GitHub ref/job/artifact, image ar…
Oseltamivir Jun 25, 2026
82c6130
CollectiveX: structured placement metadata + routing locality fractio…
Oseltamivir Jun 25, 2026
e273009
CollectiveX: scaling efficiency (strong/weak from EP4/EP8) + regressi…
Oseltamivir Jun 25, 2026
978d338
CollectiveX: MI355X cross-vendor canonical-workload consume driver (D…
Oseltamivir Jun 25, 2026
a413de2
CollectiveX plotter: fix grid 'undefined' panel title (stale 'serial'…
Oseltamivir Jun 26, 2026
d799e0f
CollectiveX plotter: prefill panels show only the real prefill range …
Oseltamivir Jun 26, 2026
1622dff
CollectiveX plotter: --legacy {all,exclude,only} — v4-only main plot …
Oseltamivir Jun 26, 2026
f5df0ea
CollectiveX GHA: add routing/eplb inputs + h200/gb300 SKUs; wire CX_E…
Oseltamivir Jun 26, 2026
bb296c4
CollectiveX: launch_gb300-nv.sh — GHA launcher for GB300 (EP4 via run…
Oseltamivir Jun 26, 2026
73da67b
CollectiveX GHA: per-(SKU+config) concurrency group so a multi-config…
Oseltamivir Jun 26, 2026
0df55e8
CollectiveX: per-runner stage dir (fix concurrent-dispatch stale-hand…
Oseltamivir Jun 26, 2026
13f0a0f
CollectiveX: fix H200 GHA launcher FS (/home/sa-shared, not /mnt/nfs)
Oseltamivir Jun 26, 2026
9fb6e5d
CollectiveX: H200 partition main (not hpc-gpu-1)
Oseltamivir Jun 26, 2026
2b5e26c
CollectiveX: GB300 launcher uses docker tag, not squash path
Oseltamivir Jun 26, 2026
d2433e3
CollectiveX: pin h200 dispatch to the h200-dgxc runner pool
Oseltamivir Jun 26, 2026
156bf44
CollectiveX: GHA campaign tooling — collector + matrix dry-label fix
Oseltamivir Jun 26, 2026
59a05e0
CollectiveX: gitignore _ssh_v4_archive/ (superseded SSH result JSONs)
Oseltamivir Jun 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions .github/workflows/collectivex-experimental.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
name: CollectiveX Experimental

# Orchestration only — all benchmark logic lives in experimental/CollectiveX/.
# Push to the feature branch runs the MI355X MoRI dispatch/combine benchmark (no
# merge to main needed); workflow_dispatch runs a chosen SKU + benchmark (the lane
# for GB200/B200 NCCL, DeepEP, and larger sweeps). Each job lands on the SKU's
# self-hosted runner and invokes that SKU's launch script — the same
# launch_${RUNNER_NAME%%_*}.sh convention the serving benchmarks use.

on:
push:
branches:
- collectivex
paths:
- 'experimental/CollectiveX/**'
- '.github/workflows/collectivex-experimental.yml'
workflow_dispatch:
inputs:
sku:
# Only SKUs with a matching launchers/launch_<prefix>.sh are offered —
# runner.name's prefix selects the script, so an SKU without one fails.
description: Self-hosted runner pool (must have a CollectiveX launcher)
type: choice
default: gb200
options: [gb200, b200-dgxc, b200-multinode, mi355x, h100-dgxc, h200, b300, gb300]
benchmark:
# mori runs only on mi355x; nccl/deepep/all on the NVIDIA SKUs.
description: Which benchmark to run
type: choice
default: nccl
options: [nccl, deepep, mori, all]
ops:
description: NCCL ops (space-separated); blank = default set
type: string
default: ''
min_bytes:
description: nccl-tests min message size
type: string
default: '8'
max_bytes:
description: nccl-tests max message size
type: string
default: '8G'
ngpus:
description: GPUs per node (blank = SKU default)
type: string
default: ''
nodes:
description: Node count (gb200 multi-node MNNVL; 2 = 8 GPU). Blank/1 = single node.
type: string
default: ''
phase:
# EP only. 'both' fans out to one job per phase (decode + prefill).
description: EP phase — decode (small T) / prefill (large T); 'both' = a job each
type: choice
default: both
options: [both, decode, prefill]
tokens_ladder:
description: EP source-tokens-per-rank sweep (space/comma sep); blank = phase default
type: string
default: ''
dispatch_dtype:
description: EP dispatch payload precision
type: choice
default: bf16
options: [bf16, fp8]
mode:
# normal = high-throughput kernels (decode+prefill); ll = DeepEP low-latency
# (decode-shaped, fp8 cast in-kernel). LL is rejected on backends without it
# (MoRI) and aborts on fabrics that lack it (B300) — run only where supported.
description: EP kernel path — normal or low-latency (LL)
type: choice
default: normal
options: [normal, ll]
resource_mode:
# normalized = ~sm_fraction of device units (cross-vendor apples-to-apples);
# tuned = each backend's own recommended/default launch config.
description: Comm resource regime
type: choice
default: normalized
options: [normalized, tuned, default]
contract:
# layout-and-dispatch-v1 = dispatch timing includes routing-layout gen (the only
# contract MoRI honors; use for cross-vendor). cached-layout-comm-only-v1 = layout
# hoisted out, pure-comm dispatch (DeepEP normal only).
description: Measurement contract (timing boundary)
type: choice
default: layout-and-dispatch-v1
options: [layout-and-dispatch-v1, cached-layout-comm-only-v1]
routing:
# Routing distribution of the shared trace. uniform=realistic; balanced=load-equalized;
# zipf*=skewed; hotspot-single=one hot expert. The skew + EPLB sweep lives here.
description: EP routing distribution
type: choice
default: uniform
options: [uniform, balanced, zipf, zipf-mild, zipf-moderate, zipf-heavy, hotspot-single]
eplb:
# EPLB = replicate hot experts + balanced-place (the remedy for skewed routing). A pure
# routing-trace transform; experts -> num_logical+redundant. Meaningful with zipf*.
description: Apply EPLB expert replication/placement
type: boolean
default: false

concurrency:
# Group per (SKU + FULL config): GitHub keeps only one running + one pending per group and
# cancels the rest, so a coarse per-SKU group made a fan-out of many configs on one SKU
# self-cancel down to ~2. Including dtype/mode/contract/routing/eplb/phase gives each config
# its OWN group -> all configs survive; they queue only on the runner's own capacity, not on
# GitHub concurrency. cancel-in-progress FALSE so a re-dispatch of the SAME config queues.
group: collectivex-${{ github.ref }}-${{ github.event_name }}-${{ inputs.sku || 'push' }}-${{ inputs.dispatch_dtype }}-${{ inputs.mode }}-${{ inputs.contract }}-${{ inputs.routing }}-${{ inputs.eplb }}-${{ inputs.phase }}
cancel-in-progress: false

permissions:
contents: read

jobs:
# Push -> MI355X MoRI dispatch/combine. Lands on a free mi355x-amds runner and
# runs launch_mi355x-amds.sh (CX_BENCH=mori). The AMD workspace is compute-
# visible, so no CX_STAGE_DIR; the launcher defaults to 8 GPUs.
experimental:
name: CollectiveX Experimental (${{ matrix.phase }})
if: github.event_name == 'push'
runs-on: mi355x
timeout-minutes: 90
strategy:
fail-fast: false
matrix:
# Push = a fast MoRI SMOKE only (decode). The full sweep is workflow_dispatch.
phase: [decode]
env:
CX_BENCH: mori
CX_PHASE: ${{ matrix.phase }}
# SMOKE ladder capped at T<=16: MoRI + realistic (fan-out≈5.3) routing currently
# WEDGES at T>=32 (under investigation; DeepEP is fine), and an unguarded run hung
# ~1 h before the job timeout. Keep the push smoke in the known-good range; run the
# full sweep via workflow_dispatch (timeout-guarded). Remove the cap once fixed.
CX_TOKENS_LADDER: "1 2 4 8 16"
CX_RUN_TIMEOUT: "600"
# Pin to the MI355X nodes that hold the node-local squash and have a writable
# /var/lib/squash; other nodes need a slow cold import that can fail on lock/
# cache permissions. Widen once the squash is staged cluster-wide.
CX_NODELIST: mia1-p01-g10,mia1-p01-g15
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
- name: Launch MI355X MoRI (${{ matrix.phase }})
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips result failure gate

Medium Severity

Both jobs only run summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plain summarize.py gate on the checkout’s results/ after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f48daed. Configure here.

- name: Upload results
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: collectivex_mi355x_mori_${{ matrix.phase }}_${{ github.run_id }}
path: experimental/CollectiveX/results/*.json
if-no-files-found: warn
Comment thread
cursor[bot] marked this conversation as resolved.

# Manual dispatch -> chosen SKU + benchmark. Lands on the inputs.sku runner.
dispatch:
if: github.event_name == 'workflow_dispatch'
# The bare `h200` label spans TWO clusters: 14 h200-dgxc runners (login-0; the EP
# path is validated there) and 2 h200-cw (CoreWeave) runners that have no
# launch_h200-cw.sh and die exit 127. Pin h200 to the h200-dgxc pool so every
# dispatch lands where the launcher + FS + partition are known-good. Other SKUs are
# single-pool, so pass the sku through unchanged.
runs-on: ${{ inputs.sku == 'h200' && 'h200-dgxc' || inputs.sku }}
timeout-minutes: 120
strategy:
fail-fast: false
matrix:
# nccl/rccl are collective primitives — phase is meaningless, so run ONE job (not
# the same work twice). EP backends: 'both' -> decode + prefill; else a single job.
phase: ${{ fromJSON((inputs.benchmark == 'nccl' || inputs.benchmark == 'rccl') && '["na"]' || (inputs.phase == 'both' && '["decode","prefill"]' || format('["{0}"]', inputs.phase))) }}
env:
CX_BENCH: ${{ inputs.benchmark }}
CX_OPS: ${{ inputs.ops }}
CX_MIN_BYTES: ${{ inputs.min_bytes }}
CX_MAX_BYTES: ${{ inputs.max_bytes }}
CX_NGPUS: ${{ inputs.ngpus }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow ngpus env ignored

Medium Severity

The dispatch job sets CX_NGPUS from the ngpus input, but GB200 and B200 multi-node launchers read CX_GPUS_PER_NODE (defaults 4 and 8) and never CX_NGPUS. Changing ngpus in the workflow does not affect Slurm allocation or world size on those SKUs.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a3a492c. Configure here.

CX_NODES: ${{ inputs.nodes }}
CX_PHASE: ${{ matrix.phase }}
CX_TOKENS_LADDER: ${{ inputs.tokens_ladder }}
CX_DISPATCH_DTYPE: ${{ inputs.dispatch_dtype }}
CX_MODE: ${{ inputs.mode }}
CX_RESOURCE_MODE: ${{ inputs.resource_mode }}
CX_MEASUREMENT_CONTRACT: ${{ inputs.contract }}
CX_ROUTING: ${{ inputs.routing }}
CX_EPLB: ${{ inputs.eplb && '1' || '' }}
# GHA run provenance: run_ep records git_run (repo/run/attempt/ref/sha/job) -> a GHA result
# is provenance_complete (publication_status >= comparable-experimental, official w/ canonical).
COLLECTIVEX_SOURCE_SHA: ${{ github.sha }}
COLLECTIVEX_ARTIFACT_NAME: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }}
# GB200/watchtower needs a compute-visible workspace; harmless elsewhere.
CX_STAGE_DIR: ${{ inputs.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }}
# MI355X: pin to the warm-squash, writable nodes (see the push job).
CX_NODELIST: ${{ inputs.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
# Reject an unsupported backend/SKU/mode/dtype/contract BEFORE consuming the runner
# (review #3): fail fast on the login node, not after a salloc. 'all' fans out per
# vendor in-container, so skip the single-combo check for it.
- name: Validate capability
if: inputs.benchmark != 'all'
run: |
python3 experimental/CollectiveX/tests/capability.py \
--sku "${{ inputs.sku }}" --backend "${{ inputs.benchmark }}" \
--mode "${{ inputs.mode }}" --dtype "${{ inputs.dispatch_dtype }}" \
--contract "${{ inputs.contract }}"
- name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }} (${{ matrix.phase }})
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
Comment thread
cursor[bot] marked this conversation as resolved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow skips multinode staging

Medium Severity

CX_STAGE_DIR is set only when inputs.sku is gb200. The b200-multinode dispatch target uses launch_b200-dgxc-slurm.sh, which documents the same compute-visible checkout requirement but leaves staging unset, so Slurm jobs may not see the repo mount.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here.

- name: Results summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY"
- name: Upload results
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ matrix.phase }}_${{ github.run_id }}
path: experimental/CollectiveX/results/*.json
if-no-files-found: warn

update-frontend-snapshot:
name: Update InferenceX-app snapshot
needs: [experimental, dispatch]
if: >-
always() &&
(
(github.event_name == 'push' && needs.experimental.result == 'success') ||
(github.event_name == 'workflow_dispatch' && needs.dispatch.result == 'success')
)
runs-on: ubuntu-latest
steps:
- name: Trigger CollectiveX snapshot update
env:
FRONTEND_PAT: ${{ secrets.INFX_FRONTEND_PAT }}
run: |
set -euo pipefail
curl -sSf -X POST \
-H "Authorization: Bearer $FRONTEND_PAT" \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/SemiAnalysisAI/InferenceX-app/dispatches \
-d '{
"event_type": "update-collectivex-data",
"client_payload": {
"source_run_id": "${{ github.run_id }}"
}
}'
18 changes: 18 additions & 0 deletions experimental/CollectiveX/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# in-container nccl-tests build cache
.nccl-tests/
# python
__pycache__/
*.pyc
# generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs,
# so keep results out of git (CI uploads them as workflow artifacts instead).
# Sanitized headline numbers live in CONTAINERS.md.
results/*.json
results/plots/
results/raw_*.txt
results/raw_*.txt.stderr
# superseded SSH-provenance result JSONs moved aside so plot_ep's recursive glob
# won't double-load them; same hostname/UUID sensitivity as results/.
_ssh_v4_archive/
# running local-only reflection log (not a committed artifact)
notes.md
goal.md
Loading
Loading