Skip to content

[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1912

Closed
Duyi-Wang wants to merge 2 commits into
SemiAnalysisAI:mainfrom
Duyi-Wang:feat/minimaxm3-fp4-mi355x-vllm-disagg
Closed

[AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe#1912
Duyi-Wang wants to merge 2 commits into
SemiAnalysisAI:mainfrom
Duyi-Wang:feat/minimaxm3-fp4-mi355x-vllm-disagg

Conversation

@Duyi-Wang

@Duyi-Wang Duyi-Wang commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X over the MoRI-IO KV connector.

Recipe

  • benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
  • models_vllm.yaml: MiniMax-M3-MXFP4 entry. --block-size 128 (MSA), TRITON_ATTN, --language-model-only, AITER MoE, minimax_m3 parsers, --max-num-seqs 512.
  • amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two TP4 layouts (1P1D and 2P1D), conc 1..512.

Supporting fixes to the shared vllm-disagg path

  • server_vllm.sh: count prefill/decode GPUs from the per-worker TP size (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
  • env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR etc.) for the vllm-disagg path. They were only set in the SGLang branch, so vllm-disagg ran at the default send-queue depth and stalled at high concurrency ("SQ full"). Injected via docker -e so they reach the vLLM worker processes.

Disaggregated (prefill/decode) vLLM recipe for amd/MiniMax-M3-MXFP4 on MI355X
over the MoRI-IO KV connector.

Recipe:
- benchmarks/multi_node/minimaxm3_fp4_mi355x_vllm-disagg.sh: launcher.
- models_vllm.yaml: MiniMax-M3-MXFP4 entry. block-size 128 (MSA), TRITON_ATTN,
  --language-model-only, AITER MoE, minimax_m3 parsers. No --kv-cache-dtype fp8
  (the checkpoint ships no calibrated FP8 KV scales).
- amd-master.yaml: minimaxm3-fp4-mi355x-vllm-disagg config, 8k1k, two layouts
  (1P1D TP4 and 2P1D TP4), conc 1..512.

Supporting fixes to the shared vllm-disagg path:
- server_vllm.sh: count prefill/decode GPUs from the per-worker TP size
  (PREFILL_TP_SIZE*xP / DECODE_TP_SIZE*yD) instead of GPUS_PER_NODE*xP. With
  TP < node GPU count (e.g. TP4 on an 8-GPU node) the old expression
  over-counted, corrupting PREFILL_GPUS/DECODE_GPUS and halving tput_per_gpu.
- env.sh / job.slurm: set the MoRI-IO RDMA QP knobs (MORI_IO_QP_MAX_SEND_WR
  etc.) for the vllm-disagg path. They were only set in the SGLang branch, so
  vllm-disagg ran at the default send-queue depth and stalled at high
  concurrency ("SQ full").

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@Duyi-Wang

Copy link
Copy Markdown
Collaborator Author

Superseded by #1914 (re-opened from a branch in this repo instead of the fork).

@Duyi-Wang Duyi-Wang closed this Jun 24, 2026
@Duyi-Wang Duyi-Wang deleted the feat/minimaxm3-fp4-mi355x-vllm-disagg branch June 24, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants