[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1912
Closed
Duyi-Wang wants to merge 2 commits into
Closed
Conversation
Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X
over the MoRI-IO KV connector.
Recipe:
- benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
- models_vllm.yaml: MiniMax-M3-MXFP4 entry. block-size 128 (MSA), TRITON_ATTN,
--language-model-only, AITER MoE, minimax_m3 parsers. No --kv-cache-dtype fp8
(the checkpoint ships no calibrated FP8 KV scales).
- amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two layouts
(1P1D TP4 and 2P1D TP4), conc 1..512.
Supporting fixes to the shared vllm-disagg path:
- server_vllm.sh: count prefill/decode GPUs from the per-worker TP size
(PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With
TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression
over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
- env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR
etc.) for the vllm-disagg path. They were only set in the SGLang branch, so
vllm-disagg ran at the default send-queue depth and stalled at high
concurrency ("SQ full").
Collaborator
Author
|
Superseded by #1914 (re-opened from a branch in this repo instead of the fork). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disaggregated (prefill/decode) vLLM recipe for
amd/MiniMax-M3-MXFP4on MI355X over the MoRI-IO KV connector.Recipe
benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.models_vllm.yaml:MiniMax-M3-MXFP4entry.--block-size 128(MSA),TRITON_ATTN,--language-model-only, AITER MoE,minimax_m3parsers,--max-num-seqs 512.amd-master.yaml:minimaxm3-fp4-mi355x-vllm-disaggconfig, 8k1k, two TP4 layouts (1P1D and 2P1D), conc 1..512.Supporting fixes to the shared vllm-disagg path
server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP/DECODE_TP_SIZE*yD) instead ofGPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corruptingPREFILL_GPUS/DECODE_GPUSand halvingtput_per_gpu.env.sh/job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WRetc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full"). Injected via docker-eso they reach the vLLM worker processes.