Skip to content

[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1914

Open
Duyi-Wang wants to merge 9 commits into
mainfrom
feat/minimaxm3-fp4-mi355x-vllm-disagg
Open

[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1914
Duyi-Wang wants to merge 9 commits into
mainfrom
feat/minimaxm3-fp4-mi355x-vllm-disagg

Conversation

@Duyi-Wang

Copy link
Copy Markdown
Collaborator

Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X over the MoRI-IO KV connector.

Recipe

  • benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
  • models_vllm.yaml: MiniMax-M3-MXFP4 entry. --block-size 128 (MSA), TRITON_ATTN, --language-model-only, AITER MoE, minimax_m3 parsers, --max-num-seqs 512.
  • amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two TP4 layouts (1P1D and 2P1D), conc 1..512.

Supporting fixes to the shared vllm-disagg path

  • server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
  • env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR etc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full"). Injected via docker -e so they reach the vLLM worker processes.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

Comment thread benchmarks/multi_node/amd_utils/env.sh Outdated
Comment on lines +121 to +125
export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000
export MORI_IO_QP_MAX_SEND_WR=16384
export MORI_IO_QP_MAX_CQE=32768
export MORI_IO_QP_MAX_SGE=2
export MORI_IO_TC_DISABLE=0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new vllm-disagg branch in env.sh (lines 121-125) sets MORI_IO_TC_DISABLE=0 (enables traffic-class steering) but never sets MORI_IO_TC — so MoRI-IO RDMA falls back to its built-in default (TC=0) instead of the cluster's PFC no-drop class. The SGLang branch carefully detects this via nicctl/hostname (TC=96 on smci355-ccs-aus*/GPU*, TC=104 on mia1*) and sets both MORI_IO_TC and MORI_IO_TC_DISABLE=0; the vllm-disagg port appears to be incomplete. Fix is straightforward: either replicate the SGLang QoS-detection block, or alias export MORI_IO_TC=$UCX_IB_TRAFFIC_CLASS after the existing UCX detection runs (lines 76-116 in the same branch already compute the right TC for UCX). Secondary: the five MORI_IO_* lines use bare export VAR=value rather than export VAR=${VAR:-value}, which defeats the docker -e VAR=${VAR:-default} override pattern job.slurm (lines 406-410) adds; this matches the SGLang branch's existing convention, so it's lower-stakes.

Extended reasoning...

Primary finding: missing MORI_IO_TC

The vllm-disagg branch's MoRI-IO block is a partial port of the SGLang knobs. Compare the two branches in env.sh:

vllm-disagg branch (new, lines 121-125):

export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000
export MORI_IO_QP_MAX_SEND_WR=16384
export MORI_IO_QP_MAX_CQE=32768
export MORI_IO_QP_MAX_SGE=2
export MORI_IO_TC_DISABLE=0       # <-- enables TC usage
                                  # <-- no MORI_IO_TC ever set

SGLang branch (lines 140-237): sets the same QP knobs, plus a full nicctl/hostname QoS detection block that exports both MORI_IO_TC=<value> (96 or 104 depending on cluster) and MORI_IO_TC_DISABLE=0 together (lines 199-205 for nicctl, lines 210-217 / 225-232 for hostname fallback).

Why this matters

server_vllm.sh configures kv_connector: MoRIIOConnector in the --kv-transfer-config JSON, so the actual KV-cache RDMA goes through the MoRI-IO connector. MoRI-IO reads the MORI_IO_* family of env vars (not UCX_IB_*). The UCX side-channel detection at env.sh lines 76-116 correctly sets UCX_IB_TRAFFIC_CLASS for the Nixl handshake path, but does not configure MoRI-IO.

With MORI_IO_TC_DISABLE=0 enabling the TC path and MORI_IO_TC unset, MoRI-IO uses its built-in default (typically TC=0, best-effort) instead of the cluster's lossless priority. On the two named target clusters this means:

  • smci355-ccs-aus* / GPU* nodes: should be TC=96, will get TC=0
  • mia1* nodes: should be TC=104, will get TC=0

Step-by-step proof

Concrete scenario on an smci355-ccs-aus node running the new minimaxm3-fp4-mi355x-vllm-disagg recipe:

  1. job.slurm brings up the container and the entrypoint reaches server_vllm.sh.
  2. server_vllm.sh line ~52 does source $WS_PATH/env.sh.
  3. env.sh takes the ENGINE == vllm-disagg branch (line 64).
  4. nicctl is available, so the UCX block (lines 87-95) computes UCX_IB_TRAFFIC_CLASS=96 (PFC no-drop priority). Good.
  5. Control reaches lines 121-125. MORI_IO_TC_DISABLE=0 is exported — "please use a traffic class". MORI_IO_TC remains unset.
  6. vLLM starts, instantiates MoRIIOConnector, which reads MORI_IO_TC. It's unset → built-in default kicks in (TC=0).
  7. KV-cache RDMA traffic now flows on best-effort priority 0 instead of the PFC no-drop priority 6 (DSCP 24, TC = 4×24 = 96).
  8. Under high-concurrency load (the conc=512 sweep in amd-master.yaml) PFC backpressure never engages for KV traffic; on a congested fabric you can get packet drops + recovery latency instead of the lossless behavior the cluster is configured for.

Equivalently: job.slurm lines 406-410 propagate MORI_IO_TC_DISABLE through docker -e but contain no -e MORI_IO_TC=… line, so even if a host-side MORI_IO_TC existed it wouldn't reach the worker container.

Tension with the PR's stated intent

The description says these knobs are added because "vllm-disagg ran at the default send-queue depth and stalled at high concurrency ('SQ full')". Bumping MORI_IO_QP_MAX_SEND_WR etc. addresses one half of that; getting RDMA traffic onto the PFC no-drop priority is the other half. Skipping MORI_IO_TC while turning on MORI_IO_TC_DISABLE=0 is the worst combination — TC steering is enabled but pointed at the wrong class.

Secondary observation: defeated override path

job.slurm lines 406-410 use the canonical docker -e VAR=${VAR:-default} idiom — the :-default form strongly signals "pre-set values are honored, otherwise use the default". But env.sh lines 121-125 use unconditional export VAR=value, which clobbers whatever docker -e propagated, since server_vllm.sh sources env.sh after the container starts. A user setting MORI_IO_QP_MAX_SEND_WR=32768 before sbatch to investigate SQ stalls would see job.slurm correctly forward 32768, then env.sh silently reset it back to 16384.

Counter-argument acknowledged: the SGLang branch (lines 142-146) uses the same unconditional pattern, so this matches existing convention rather than introducing new divergence. Defaults match on both sides, so the bare-default path works correctly out of the box. The docker -e block as written effectively becomes dead code with respect to override semantics. Worth flagging because the ${VAR:-default} idiom in job.slurm is the canonical override-support pattern and reads as such; either align both files on ${VAR:-default} or remove the docker -e block for these vars.

How to fix

Easiest: add a single line after the UCX detection block (around line 119), reusing the value already detected for UCX:

export MORI_IO_TC=${MORI_IO_TC:-${UCX_IB_TRAFFIC_CLASS:-}}

(skip the export entirely when UCX_IB_TRAFFIC_CLASS is empty, so MoRI-IO still picks up its default rather than MORI_IO_TC="").

Cleaner: replicate the nicctl/hostname QoS detection block from the SGLang branch (MORI_IO_TC=$TC + MORI_IO_SL=$ND_PRIO). Also add -e MORI_IO_TC=${MORI_IO_TC:-} to the docker -e block in job.slurm so the value reaches the worker container.

Comment thread benchmarks/multi_node/amd_utils/env.sh Outdated
Comment thread benchmarks/multi_node/amd_utils/server_vllm.sh
@Duyi-Wang Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from be92334 to ad44d70 Compare June 24, 2026 07:10
@Duyi-Wang Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from ad44d70 to c699600 Compare June 24, 2026 07:33
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Comment thread .github/configs/amd-master.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@Duyi-Wang Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from 9124ea7 to 8d135b2 Compare June 25, 2026 05:46
Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X
over the MoRI-IO KV connector.

Recipe:
- benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
- models_vllm.yaml: MiniMax-M3-MXFP4 entry. block-size 128 (MSA), TRITON_ATTN,
  --language-model-only, AITER MoE, minimax_m3 parsers. No --kv-cache-dtype fp8
  (the checkpoint ships no calibrated FP8 KV scales).
- amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two layouts
  (1P1D TP4 and 2P1D TP4), conc 1..512.

Supporting fixes to the shared vllm-disagg path:
- server_vllm.sh: count prefill/decode GPUs from the per-worker TP size
  (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With
  TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression
  over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
- env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR
  etc.) for the vllm-disagg path. They were only set in the SGLang branch, so
  vllm-disagg ran at the default send-queue depth and stalled at high
  concurrency ("SQ full").
Mirror server_sglang.sh / server_atom.sh so the bench.sh GPU count never
resolves to 0 if submit.sh did not export the per-worker TP size.
env.sh exports the MORI_IO_* QP knobs but they do not propagate to the vLLM
worker processes, so inject them into the container base env via docker -e.
MORI_IO_TC_DISABLE is intentionally omitted: the TC value is detected per-node
in env.sh and cannot reach the workers, so enabling TC steering without a TC
value would just fall back to TC=0; leave MoRI-IO at its library default instead.
…P1D only

- amd-master.yaml: bump image to rocm/sgl-dev:vllm-0.23.1-rocm723-mi35x-mori-0624-2;
  drop the 2P1D layout and cap the 1P1D conc sweep at 256.
- server_vllm.sh: set read_mode:true on the MoRIIOConnector kv-transfer-config.
- job.slurm: drop the MORI_IO_* docker -e injection (handled by the image / env.sh).
…tokens 32768

- models_vllm.yaml: add prefill_env (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4,
  VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB=2048); bump prefill+decode
  --max-num-batched-tokens 16384 -> 32768.
- server_vllm.sh: add a prefill_env / PREFILL_MODEL_ENVS channel (mirrors the
  existing decode_env path) so prefill-only env reaches every prefill rank.
…mbing

Remove the prefill-only VLLM_ROCM_QUICK_REDUCE_* env from the MiniMax-M3-MXFP4
model entry, and the now-unused prefill_env / PREFILL_MODEL_ENVS channel in
server_vllm.sh.
… 2P1D

- image -> rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625
- 1P1D TP4 conc sweep back to 1..512
- re-add 2P1D TP4 layout at conc 128/256/512
@Duyi-Wang Duyi-Wang force-pushed the feat/minimaxm3-fp4-mi355x-vllm-disagg branch from 8d135b2 to f1e314d Compare June 25, 2026 05:49
@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants