[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1914
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
3298bcf to
6c0e812
Compare
| export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000 | ||
| export MORI_IO_QP_MAX_SEND_WR=16384 | ||
| export MORI_IO_QP_MAX_CQE=32768 | ||
| export MORI_IO_QP_MAX_SGE=2 | ||
| export MORI_IO_TC_DISABLE=0 |
There was a problem hiding this comment.
🔴 The new vllm-disagg branch in env.sh (lines 121-125) sets MORI_IO_TC_DISABLE=0 (enables traffic-class steering) but never sets MORI_IO_TC — so MoRI-IO RDMA falls back to its built-in default (TC=0) instead of the cluster's PFC no-drop class. The SGLang branch carefully detects this via nicctl/hostname (TC=96 on smci355-ccs-aus*/GPU*, TC=104 on mia1*) and sets both MORI_IO_TC and MORI_IO_TC_DISABLE=0; the vllm-disagg port appears to be incomplete. Fix is straightforward: either replicate the SGLang QoS-detection block, or alias export MORI_IO_TC=$UCX_IB_TRAFFIC_CLASS after the existing UCX detection runs (lines 76-116 in the same branch already compute the right TC for UCX). Secondary: the five MORI_IO_* lines use bare export VAR=value rather than export VAR=${VAR:-value}, which defeats the docker -e VAR=${VAR:-default} override pattern job.slurm (lines 406-410) adds; this matches the SGLang branch's existing convention, so it's lower-stakes.
Extended reasoning...
Primary finding: missing MORI_IO_TC
The vllm-disagg branch's MoRI-IO block is a partial port of the SGLang knobs. Compare the two branches in env.sh:
vllm-disagg branch (new, lines 121-125):
export MORI_IO_SQ_BACKOFF_TIMEOUT_US=50000
export MORI_IO_QP_MAX_SEND_WR=16384
export MORI_IO_QP_MAX_CQE=32768
export MORI_IO_QP_MAX_SGE=2
export MORI_IO_TC_DISABLE=0 # <-- enables TC usage
# <-- no MORI_IO_TC ever setSGLang branch (lines 140-237): sets the same QP knobs, plus a full nicctl/hostname QoS detection block that exports both MORI_IO_TC=<value> (96 or 104 depending on cluster) and MORI_IO_TC_DISABLE=0 together (lines 199-205 for nicctl, lines 210-217 / 225-232 for hostname fallback).
Why this matters
server_vllm.sh configures kv_connector: MoRIIOConnector in the --kv-transfer-config JSON, so the actual KV-cache RDMA goes through the MoRI-IO connector. MoRI-IO reads the MORI_IO_* family of env vars (not UCX_IB_*). The UCX side-channel detection at env.sh lines 76-116 correctly sets UCX_IB_TRAFFIC_CLASS for the Nixl handshake path, but does not configure MoRI-IO.
With MORI_IO_TC_DISABLE=0 enabling the TC path and MORI_IO_TC unset, MoRI-IO uses its built-in default (typically TC=0, best-effort) instead of the cluster's lossless priority. On the two named target clusters this means:
smci355-ccs-aus*/GPU*nodes: should be TC=96, will get TC=0mia1*nodes: should be TC=104, will get TC=0
Step-by-step proof
Concrete scenario on an smci355-ccs-aus node running the new minimaxm3-fp4-mi355x-vllm-disagg recipe:
job.slurmbrings up the container and the entrypoint reachesserver_vllm.sh.server_vllm.shline ~52 doessource $WS_PATH/env.sh.env.shtakes theENGINE == vllm-disaggbranch (line 64).- nicctl is available, so the UCX block (lines 87-95) computes
UCX_IB_TRAFFIC_CLASS=96(PFC no-drop priority). Good. - Control reaches lines 121-125.
MORI_IO_TC_DISABLE=0is exported — "please use a traffic class".MORI_IO_TCremains unset. - vLLM starts, instantiates
MoRIIOConnector, which readsMORI_IO_TC. It's unset → built-in default kicks in (TC=0). - KV-cache RDMA traffic now flows on best-effort priority 0 instead of the PFC no-drop priority 6 (DSCP 24, TC = 4×24 = 96).
- Under high-concurrency load (the conc=512 sweep in
amd-master.yaml) PFC backpressure never engages for KV traffic; on a congested fabric you can get packet drops + recovery latency instead of the lossless behavior the cluster is configured for.
Equivalently: job.slurm lines 406-410 propagate MORI_IO_TC_DISABLE through docker -e but contain no -e MORI_IO_TC=… line, so even if a host-side MORI_IO_TC existed it wouldn't reach the worker container.
Tension with the PR's stated intent
The description says these knobs are added because "vllm-disagg ran at the default send-queue depth and stalled at high concurrency ('SQ full')". Bumping MORI_IO_QP_MAX_SEND_WR etc. addresses one half of that; getting RDMA traffic onto the PFC no-drop priority is the other half. Skipping MORI_IO_TC while turning on MORI_IO_TC_DISABLE=0 is the worst combination — TC steering is enabled but pointed at the wrong class.
Secondary observation: defeated override path
job.slurm lines 406-410 use the canonical docker -e VAR=${VAR:-default} idiom — the :-default form strongly signals "pre-set values are honored, otherwise use the default". But env.sh lines 121-125 use unconditional export VAR=value, which clobbers whatever docker -e propagated, since server_vllm.sh sources env.sh after the container starts. A user setting MORI_IO_QP_MAX_SEND_WR=32768 before sbatch to investigate SQ stalls would see job.slurm correctly forward 32768, then env.sh silently reset it back to 16384.
Counter-argument acknowledged: the SGLang branch (lines 142-146) uses the same unconditional pattern, so this matches existing convention rather than introducing new divergence. Defaults match on both sides, so the bare-default path works correctly out of the box. The docker -e block as written effectively becomes dead code with respect to override semantics. Worth flagging because the ${VAR:-default} idiom in job.slurm is the canonical override-support pattern and reads as such; either align both files on ${VAR:-default} or remove the docker -e block for these vars.
How to fix
Easiest: add a single line after the UCX detection block (around line 119), reusing the value already detected for UCX:
export MORI_IO_TC=${MORI_IO_TC:-${UCX_IB_TRAFFIC_CLASS:-}}(skip the export entirely when UCX_IB_TRAFFIC_CLASS is empty, so MoRI-IO still picks up its default rather than MORI_IO_TC="").
Cleaner: replicate the nicctl/hostname QoS detection block from the SGLang branch (MORI_IO_TC=$TC + MORI_IO_SL=$ND_PRIO). Also add -e MORI_IO_TC=${MORI_IO_TC:-} to the docker -e block in job.slurm so the value reaches the worker container.
be92334 to
ad44d70
Compare
ad44d70 to
c699600
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28082170121 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28082773877 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28083152210 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28088896308 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28097926713 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104118667 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28104424231 |
9124ea7 to
8d135b2
Compare
Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X
over the MoRI-IO KV connector.
Recipe:
- benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
- models_vllm.yaml: MiniMax-M3-MXFP4 entry. block-size 128 (MSA), TRITON_ATTN,
--language-model-only, AITER MoE, minimax_m3 parsers. No --kv-cache-dtype fp8
(the checkpoint ships no calibrated FP8 KV scales).
- amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two layouts
(1P1D TP4 and 2P1D TP4), conc 1..512.
Supporting fixes to the shared vllm-disagg path:
- server_vllm.sh: count prefill/decode GPUs from the per-worker TP size
(PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With
TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression
over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
- env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR
etc.) for the vllm-disagg path. They were only set in the SGLang branch, so
vllm-disagg ran at the default send-queue depth and stalled at high
concurrency ("SQ full").
Mirror server_sglang.sh / server_atom.sh so the bench.sh GPU count never resolves to 0 if submit.sh did not export the per-worker TP size.
env.sh exports the MORI_IO_* QP knobs but they do not propagate to the vLLM worker processes, so inject them into the container base env via docker -e. MORI_IO_TC_DISABLE is intentionally omitted: the TC value is detected per-node in env.sh and cannot reach the workers, so enabling TC steering without a TC value would just fall back to TC=0; leave MoRI-IO at its library default instead.
…P1D only - amd-master.yaml: bump image to rocm/sgl-dev:vllm-0.23.1-rocm723-mi35x-mori-0624-2; drop the 2P1D layout and cap the 1P1D conc sweep at 256. - server_vllm.sh: set read_mode:true on the MoRIIOConnector kv-transfer-config. - job.slurm: drop the MORI_IO_* docker -e injection (handled by the image / env.sh).
…tokens 32768 - models_vllm.yaml: add prefill_env (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB=2048); bump prefill+decode --max-num-batched-tokens 16384 -> 32768. - server_vllm.sh: add a prefill_env / PREFILL_MODEL_ENVS channel (mirrors the existing decode_env path) so prefill-only env reaches every prefill rank.
…mbing Remove the prefill-only VLLM_ROCM_QUICK_REDUCE_* env from the MiniMax-M3-MXFP4 model entry, and the now-unused prefill_env / PREFILL_MODEL_ENVS channel in server_vllm.sh.
… 2P1D - image -> rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625 - 1P1D TP4 conc sweep back to 1..512 - re-add 2P1D TP4 layout at conc 128/256/512
8d135b2 to
f1e314d
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28149795726 |
Disaggregated (prefill/decode) vLLM recipe for
amd/MiniMax-M3-MXFP4on MI355X over the MoRI-IO KV connector.Recipe
benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.models_vllm.yaml:MiniMax-M3-MXFP4entry.--block-size 128(MSA),TRITON_ATTN,--language-model-only, AITER MoE,minimax_m3parsers,--max-num-seqs 512.amd-master.yaml:minimaxm3-fp4-mi355x-vllm-disaggconfig, 8k1k, two TP4 layouts (1P1D and 2P1D), conc 1..512.Supporting fixes to the shared vllm-disagg path
server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP/DECODE_TP_SIZE*yD) instead ofGPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corruptingPREFILL_GPUS/DECODE_GPUSand halvingtput_per_gpu.env.sh/job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WRetc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full"). Injected via docker-eso they reach the vLLM worker processes.