[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe by Duyi-Wang · Pull Request #1914 · SemiAnalysisAI/InferenceX

Duyi-Wang · 2026-06-24T06:22:39Z

Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X over the MoRI-IO KV connector.

Recipe

benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
models_vllm.yaml: MiniMax-M3-MXFP4 entry. --block-size 128 (MSA), TRITON_ATTN, --language-model-only, AITER MoE, minimax_m3 parsers, --max-num-seqs 512.
amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two TP4 layouts (1P1D and 2P1D), conc 1..512.

Supporting fixes to the shared vllm-disagg path

server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR etc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full"). Injected via docker -e so they reach the vLLM worker processes.

github-actions · 2026-06-24T06:22:47Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-06-24T06:46:51Z

+    export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000
+    export MORI_IO_QP_MAX_SEND_WR=16384
+    export MORI_IO_QP_MAX_CQE=32768
+    export MORI_IO_QP_MAX_SGE=2
+    export MORI_IO_TC_DISABLE=0


🔴 The new vllm-disagg branch in env.sh (lines 121-125) sets MORI_IO_TC_DISABLE=0 (enables traffic-class steering) but never sets MORI_IO_TC — so MoRI-IO RDMA falls back to its built-in default (TC=0) instead of the cluster's PFC no-drop class. The SGLang branch carefully detects this via nicctl/hostname (TC=96 on smci355-ccs-aus*/GPU*, TC=104 on mia1*) and sets both MORI_IO_TC and MORI_IO_TC_DISABLE=0; the vllm-disagg port appears to be incomplete. Fix is straightforward: either replicate the SGLang QoS-detection block, or alias export MORI_IO_TC=$UCX_IB_TRAFFIC_CLASS after the existing UCX detection runs (lines 76-116 in the same branch already compute the right TC for UCX). Secondary: the five MORI_IO_* lines use bare export VAR=value rather than export VAR=${VAR:-value}, which defeats the docker -e VAR=${VAR:-default} override pattern job.slurm (lines 406-410) adds; this matches the SGLang branch's existing convention, so it's lower-stakes.

Extended reasoning...

Primary finding: missing MORI_IO_TC

The vllm-disagg branch's MoRI-IO block is a partial port of the SGLang knobs. Compare the two branches in env.sh:

vllm-disagg branch (new, lines 121-125):

export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000 export MORI_IO_QP_MAX_SEND_WR=16384 export MORI_IO_QP_MAX_CQE=32768 export MORI_IO_QP_MAX_SGE=2 export MORI_IO_TC_DISABLE=0 # <-- enables TC usage # <-- no MORI_IO_TC ever set

SGLang branch (lines 140-237): sets the same QP knobs, plus a full nicctl/hostname QoS detection block that exports both MORI_IO_TC=<value> (96 or 104 depending on cluster) and MORI_IO_TC_DISABLE=0 together (lines 199-205 for nicctl, lines 210-217 / 225-232 for hostname fallback).

Why this matters

server_vllm.sh configures kv_connector: MoRIIOConnector in the --kv-transfer-config JSON, so the actual KV-cache RDMA goes through the MoRI-IO connector. MoRI-IO reads the MORI_IO_* family of env vars (not UCX_IB_*). The UCX side-channel detection at env.sh lines 76-116 correctly sets UCX_IB_TRAFFIC_CLASS for the Nixl handshake path, but does not configure MoRI-IO.

With MORI_IO_TC_DISABLE=0 enabling the TC path and MORI_IO_TC unset, MoRI-IO uses its built-in default (typically TC=0, best-effort) instead of the cluster's lossless priority. On the two named target clusters this means:

smci355-ccs-aus* / GPU* nodes: should be TC=96, will get TC=0

mia1* nodes: should be TC=104, will get TC=0

Step-by-step proof

Concrete scenario on an smci355-ccs-aus node running the new minimaxm3-fp4-mi355x-vllm-disagg recipe:

job.slurm brings up the container and the entrypoint reaches server_vllm.sh.

server_vllm.sh line ~52 does source $WS_PATH/env.sh.

env.sh takes the ENGINE == vllm-disagg branch (line 64).

nicctl is available, so the UCX block (lines 87-95) computes UCX_IB_TRAFFIC_CLASS=96 (PFC no-drop priority). Good.

Control reaches lines 121-125. MORI_IO_TC_DISABLE=0 is exported — "please use a traffic class". MORI_IO_TC remains unset.

vLLM starts, instantiates MoRIIOConnector, which reads MORI_IO_TC. It's unset → built-in default kicks in (TC=0).

KV-cache RDMA traffic now flows on best-effort priority 0 instead of the PFC no-drop priority 6 (DSCP 24, TC = 4×24 = 96).

Under high-concurrency load (the conc=512 sweep in amd-master.yaml) PFC backpressure never engages for KV traffic; on a congested fabric you can get packet drops + recovery latency instead of the lossless behavior the cluster is configured for.

Equivalently: job.slurm lines 406-410 propagate MORI_IO_TC_DISABLE through docker -e but contain no -e MORI_IO_TC=… line, so even if a host-side MORI_IO_TC existed it wouldn't reach the worker container.

Tension with the PR's stated intent

The description says these knobs are added because "vllm-disagg ran at the default send-queue depth and stalled at high concurrency ('SQ full')". Bumping MORI_IO_QP_MAX_SEND_WR etc. addresses one half of that; getting RDMA traffic onto the PFC no-drop priority is the other half. Skipping MORI_IO_TC while turning on MORI_IO_TC_DISABLE=0 is the worst combination — TC steering is enabled but pointed at the wrong class.

Secondary observation: defeated override path

job.slurm lines 406-410 use the canonical docker -e VAR=${VAR:-default} idiom — the :-default form strongly signals "pre-set values are honored, otherwise use the default". But env.sh lines 121-125 use unconditional export VAR=value, which clobbers whatever docker -e propagated, since server_vllm.sh sources env.sh after the container starts. A user setting MORI_IO_QP_MAX_SEND_WR=32768 before sbatch to investigate SQ stalls would see job.slurm correctly forward 32768, then env.sh silently reset it back to 16384.

Counter-argument acknowledged: the SGLang branch (lines 142-146) uses the same unconditional pattern, so this matches existing convention rather than introducing new divergence. Defaults match on both sides, so the bare-default path works correctly out of the box. The docker -e block as written effectively becomes dead code with respect to override semantics. Worth flagging because the ${VAR:-default} idiom in job.slurm is the canonical override-support pattern and reads as such; either align both files on ${VAR:-default} or remove the docker -e block for these vars.

How to fix

Easiest: add a single line after the UCX detection block (around line 119), reusing the value already detected for UCX:

export MORI_IO_TC=${MORI_IO_TC:-${UCX_IB_TRAFFIC_CLASS:-}}

(skip the export entirely when UCX_IB_TRAFFIC_CLASS is empty, so MoRI-IO still picks up its default rather than MORI_IO_TC="").

Cleaner: replicate the nicctl/hostname QoS detection block from the SGLang branch (MORI_IO_TC=$TC + MORI_IO_SL=$ND_PRIO). Also add -e MORI_IO_TC=${MORI_IO_TC:-} to the docker -e block in job.slurm so the value reaches the worker container.

github-actions · 2026-06-24T07:34:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28082170121
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28082170121

github-actions · 2026-06-24T07:42:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28082773877
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28082773877

github-actions · 2026-06-24T09:09:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28083152210
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28083152210

github-actions · 2026-06-24T12:04:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28088896308
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28088896308

github-actions · 2026-06-24T13:40:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28097926713
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28097926713

github-actions · 2026-06-24T14:05:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104118667
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28104118667

github-actions · 2026-06-24T15:47:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28104424231

github-actions · 2026-06-24T15:53:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28104424231

github-actions · 2026-06-24T18:33:01Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28104424231

Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X over the MoRI-IO KV connector. Recipe: - benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher. - models_vllm.yaml: MiniMax-M3-MXFP4 entry. block-size 128 (MSA), TRITON_ATTN, --language-model-only, AITER MoE, minimax_m3 parsers. No --kv-cache-dtype fp8 (the checkpoint ships no calibrated FP8 KV scales). - amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two layouts (1P1D TP4 and 2P1D TP4), conc 1..512. Supporting fixes to the shared vllm-disagg path: - server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu. - env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR etc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full").

Mirror server_sglang.sh / server_atom.sh so the bench.sh GPU count never resolves to 0 if submit.sh did not export the per-worker TP size.

env.sh exports the MORI_IO_* QP knobs but they do not propagate to the vLLM worker processes, so inject them into the container base env via docker -e. MORI_IO_TC_DISABLE is intentionally omitted: the TC value is detected per-node in env.sh and cannot reach the workers, so enabling TC steering without a TC value would just fall back to TC=0; leave MoRI-IO at its library default instead.

…P1D only - amd-master.yaml: bump image to rocm/sgl-dev:vllm-0.23.1-rocm723-mi35x-mori-0624-2; drop the 2P1D layout and cap the 1P1D conc sweep at 256. - server_vllm.sh: set read_mode:true on the MoRIIOConnector kv-transfer-config. - job.slurm: drop the MORI_IO_* docker -e injection (handled by the image / env.sh).

…tokens 32768 - models_vllm.yaml: add prefill_env (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB=2048); bump prefill+decode --max-num-batched-tokens 16384 -> 32768. - server_vllm.sh: add a prefill_env / PREFILL_MODEL_ENVS channel (mirrors the existing decode_env path) so prefill-only env reaches every prefill rank.

…mbing Remove the prefill-only VLLM_ROCM_QUICK_REDUCE_* env from the MiniMax-M3-MXFP4 model entry, and the now-unused prefill_env / PREFILL_MODEL_ENVS channel in server_vllm.sh.

… 2P1D - image -> rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625 - 1P1D TP4 conc sweep back to 1..512 - re-add 2P1D TP4 layout at conc 128/256/512

github-actions · 2026-06-25T10:46:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28149795726

github-actions · 2026-06-25T11:36:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28149795726

github-actions · 2026-06-25T12:21:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28149795726

Duyi-Wang requested a review from a team June 24, 2026 06:22

Duyi-Wang requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 24, 2026 06:22

github-project-automation Bot added this to InferenceMAX Board Jun 24, 2026

Duyi-Wang mentioned this pull request Jun 24, 2026

[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe #1912

Closed

Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from 3298bcf to 6c0e812 Compare June 24, 2026 06:31

claude Bot reviewed Jun 24, 2026

View reviewed changes

chunfangamd requested changes Jun 24, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/env.sh Outdated

Comment thread benchmarks/multi_node/amd_utils/server_vllm.sh

Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from be92334 to ad44d70 Compare June 24, 2026 07:10

chunfangamd added the full-sweep-enabled label Jun 24, 2026

Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from ad44d70 to c699600 Compare June 24, 2026 07:33

functionstackx requested changes Jun 24, 2026

View reviewed changes

Comment thread .github/configs/amd-master.yaml Outdated

Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from 9124ea7 to 8d135b2 Compare June 25, 2026 05:46

Duyi-Wang added 4 commits June 25, 2026 05:49

[AMD] perf-changelog: add MiniMax-M3-MXFP4 MI355X vLLM disagg entry

7b3ed23

[AMD] server_vllm.sh: default PREFILL/DECODE_TP_SIZE to a full node

6592e7f

Mirror server_sglang.sh / server_atom.sh so the bench.sh GPU count never resolves to 0 if submit.sh did not export the per-worker TP size.

Duyi-Wang added 5 commits June 25, 2026 05:49

[AMD] minimaxm3-fp4 disagg: cap 1P1D conc sweep at 128

2eb5331

[AMD] minimaxm3-fp4: drop prefill_env (INT4 quick-reduce) and its plu…

5dc09e4

…mbing Remove the prefill-only VLLM_ROCM_QUICK_REDUCE_* env from the MiniMax-M3-MXFP4 model entry, and the now-unused prefill_env / PREFILL_MODEL_ENVS channel in server_vllm.sh.

[AMD] minimaxm3-fp4 disagg: mori-0625 image, 1P1D conc 1..512, re-add…

f1e314d

… 2P1D - image -> rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625 - 1P1D TP4 conc sweep back to 1..512 - re-add 2P1D TP4 layout at conc 128/256/512

Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from 8d135b2 to f1e314d Compare June 25, 2026 05:49

Uh oh!

Conversation

Duyi-Wang commented Jun 24, 2026

Recipe

Supporting fixes to the shared vllm-disagg path

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

claude Bot Jun 24, 2026

Choose a reason for hiding this comment

Primary finding: missing MORI_IO_TC

Why this matters

Step-by-step proof

Tension with the PR's stated intent

Secondary observation: defeated override path

How to fix

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Primary finding: missing `MORI_IO_TC`