Skip to content

[AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623#1930

Open
seungrokj wants to merge 4 commits into
mainfrom
amd/m3_atom_pd_fp8_0623
Open

[AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623#1930
seungrokj wants to merge 4 commits into
mainfrom
amd/m3_atom_pd_fp8_0623

Conversation

@seungrokj

@seungrokj seungrokj commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Eliminate all hardcoded MODEL_NAME == "DeepSeek-V4-Pro" / per-model checks from server_atom.sh
  • All model-specific configuration (env vars, parallel flags, MTP flags, KV cache flags, HF overrides) now driven from models_atom.yaml using the same python3 yaml.safe_load pattern as server_vllm.sh
  • Add MiniMax-M3-MXFP4 and MiniMax-M3-MXFP8 entries to models_atom.yaml with EAGLE3 MTP flags
  • Image bump for minimaxm3-fp8-mi355x-atom-disagg: rocm/atom-dev:MiniMax-M3-20260622rocm/atom-dev:MiniMax-M3-20260623

Fields added to models_atom.yaml

Field Purpose
env Space-separated KEY=VALUE pairs exported unconditionally
tp_dp_flags Parallel flags for TP+DPA mode
tp_dp_env Env vars exported only in TP+DPA mode
ep_dp_flags Parallel flags for EP+DPA mode
ep_dp_env Env vars exported only in EP+DPA mode
mtp_flags Flags prepended to SPEC_ARGS before $DECODE_MTP_SIZE
kv_cache_flags Full --kv_cache_dtype flag string
hf_overrides JSON string passed to --hf-overrides

PR Review Checklist

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this.
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this.
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If a company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or the SGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

🤖 Generated with Claude Code

seungrokj and others added 2 commits June 25, 2026 14:39
…els_atom.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…om.yaml-driven)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj changed the title [AMD] refactor server_atom.sh: drive model-specific config from models_atom.yaml [AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623 Jun 25, 2026
@seungrokj seungrokj added AMD all-evals Expand eval selection to every fixed-sequence config evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection full-sweep-enabled labels Jun 25, 2026
Comment thread benchmarks/multi_node/amd_utils/models_atom.yaml
Comment on lines +80 to +94
# =============================================================================
# Model-Specific Configuration from YAML
# =============================================================================
# Load model-specific config from YAML (single parse for all fields)
eval "$(python3 -c "
import yaml
with open('${ATOM_WS_PATH}/models_atom.yaml') as f:
m = yaml.safe_load(f).get('${MODEL_NAME}', {})
print(f'MODEL_ENVS=\"{m.get(\"env\", \"\")}\"')
print(f'MODEL_TP_DP_FLAGS=\"{m.get(\"tp_dp_flags\", \"\")}\"')
print(f'MODEL_EP_DP_FLAGS=\"{m.get(\"ep_dp_flags\", \"\")}\"')
print(f'MODEL_TP_DP_ENV=\"{m.get(\"tp_dp_env\", \"\")}\"')
print(f'MODEL_EP_DP_ENV=\"{m.get(\"ep_dp_env\", \"\")}\"')
print(f'MODEL_MTP_FLAGS=\"{m.get(\"mtp_flags\", \"\")}\"')
print(f'MODEL_KV_ARG=\"{m.get(\"kv_cache_flags\", \"\")}\"')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 hf_overrides JSON quotes stripped by bash eval — breaks DeepSeek-V4-Pro startup. The Python f-string print(f'_HF_OVERRIDES="{m.get("hf_overrides", "")}"') interpolates the raw YAML value ({"use_index_cache":true,"index_topk_freq":4}) into an outer double-quoted bash assignment, so the inner " characters terminate the outer string. After eval, _HF_OVERRIDES holds {use_index_cache:true,index_topk_freq:4} — unquoted keys — and --hf-overrides '{use_index_cache:true,index_topk_freq:4}' is rejected as invalid JSON, blocking the documented Verify server_atom.sh launches correctly for DeepSeek-V4-Pro test plan item. Fix: emit the value with shlex.quote() (or json.dumps) in the Python so the JSON survives the bash eval intact.

Extended reasoning...

What the bug is

The new YAML-driven loader at server_atom.sh:80-94 runs a Python snippet whose output is captured and eval'd as bash. The hf_overrides field for DeepSeek-V4-Pro in models_atom.yaml is the JSON string {"use_index_cache":true,"index_topk_freq":4}. The emitter line is:

print(f'_HF_OVERRIDES="{m.get("hf_overrides", "")}"')

For DeepSeek-V4-Pro this prints literally:

_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"

The inner unescaped " characters terminate the outer double-quoted region. Bash treats the line as a concatenation of quoted and unquoted word segments — "{" + use_index_cache + ":true," + index_topk_freq + ":4}" — and assembles them into a single word with all the quotes stripped.

Step-by-step proof

Reproduced locally with the exact Python emitter and bash eval:

$ python3 -c 'm={"hf_overrides": "{\"use_index_cache\":true,\"index_topk_freq\":4}"}; print(f"_HF_OVERRIDES=\"{m.get(\"hf_overrides\", \"\")}\"")'
_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"

$ eval '_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"' && echo "[$_HF_OVERRIDES]"
[{use_index_cache:true,index_topk_freq:4}]

$ python3 -c 'import json; json.loads("{use_index_cache:true,index_topk_freq:4}")'
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

After the eval, the downstream line

HF_OVERRIDES_ARG="--hf-overrides '${_HF_OVERRIDES}'"

produces --hf-overrides '{use_index_cache:true,index_topk_freq:4}' — an invalid JSON literal with unquoted keys. The atom server's argparse / json.loads on --hf-overrides will reject this at startup.

Why existing code doesn't prevent it

The pre-PR code hard-coded the value as a bash string with backslash-escaped inner quotes:
HF_OVERRIDES_ARG="--hf-overrides '{\"use_index_cache\":true,\"index_topk_freq\":4}'"
That escaping is exactly what survives bash parsing, and it is what the new YAML-driven path loses. DeepSeek-V4-Pro is the only model in models_atom.yaml with a non-empty hf_overrides (the other models' YAML fields contain no " characters, so they are unaffected); the other emitted assignments (env, tp_dp_flags, etc.) are safe.

Impact

This regresses the dsv4-fp4-mi355x-atom-disagg recipe in amd-master.yaml (which sets MODEL_NAME=DeepSeek-V4-Pro and routes through server_atom.sh). The server will fail at startup when atom's argparse calls json.loads on the --hf-overrides argument — and this is precisely the path the PR's own test plan flags (Verify server_atom.sh launches correctly for DeepSeek-V4-Pro).

Fix

Quote the value in the Python emitter so the bash eval sees a properly-escaped literal. Either:

import shlex
print(f'_HF_OVERRIDES={shlex.quote(m.get("hf_overrides", ""))}')

(produces _HF_OVERRIDES='{"use_index_cache":true,"index_topk_freq":4}', which bash parses correctly), or write each value to a NUL-delimited side channel that bash reads with read -d '' instead of evaling arbitrary Python output.

Comment on lines 193 to +195
MTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}
xP/yD : ${xP} / ${yD}
KV cache : dtype=${KV_CACHE_DTYPE:-auto} block_size=${BLOCK_SIZE} mem_frac=${MEM_FRAC_STATIC}
KV cache : ${KV_CACHE_ARG:-none} block_size=${BLOCK_SIZE} mem_frac=${MEM_FRAC_STATIC}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: line 193's INFO banner prints the literal string MTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}, but mtp_flags is now YAML-driven and MiniMax-M3-MXFP4/MXFP8 use --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3. When those models run with SPEC_DECODING=mtp, the banner will misleadingly claim method=mtp. Pure log/cosmetic — the Spec args : ${SPEC_ARGS[*]} line immediately below prints the actual flags. Suggest dropping the hardcoded method=mtp (the Spec args line already covers it) or replacing with ${MODEL_MTP_FLAGS}.

Extended reasoning...

What\n\nIn benchmarks/multi_node/amd_utils/server_atom.sh the === Configuration === heredoc still contains a hardcoded line:\n\n\nMTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}\n\n\nThis line predates the YAML-driven refactor in this PR. With the new mtp_flags field, MODEL_MTP_FLAGS can be anything — for the two new MiniMax-M3 entries it is --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens. So the banner can advertise method=mtp even when the server is actually being launched with EAGLE3 flags.\n\n### Step-by-step proof\n\n1. Set MODEL_NAME=MiniMax-M3-MXFP8, SPEC_DECODING=mtp, DECODE_MTP_SIZE=2.\n2. The YAML block (models_atom.yaml:34-39) supplies mtp_flags: --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens, so MODEL_MTP_FLAGS is the EAGLE3 string.\n3. server_atom.sh:174-176 builds SPEC_ARGS=(${MODEL_MTP_FLAGS} "$DECODE_MTP_SIZE") → the actual server is launched with --method eagle3 --draft-model … --num-speculative-tokens 2.\n4. server_atom.sh:193 still prints MTP : method=mtp num_speculative_tokens=2 — factually wrong about method.\n5. server_atom.sh:199 immediately below prints Spec args : --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens 2, which is correct.\n\n### Why this doesn't break anything\n\nThe SPEC_ARGS array (the actual flags passed to python3 -m atom.entrypoints.openai_server) is built correctly from MODEL_MTP_FLAGS; the bug is exclusively in the banner. The Spec args line two lines below shows the truth, so an operator inspecting the log can see the real flags. Also note: the new minimaxm3-fp8-mi355x-atom-disagg recipe sets SPEC_DECODING=none and DECODE_MTP_SIZE=0, so the misleading line is not exercised by anything this PR enables today — it's a latent issue for whenever a MiniMax-M3 atom-disagg recipe is added with MTP on.\n\n### How to fix\n\nReplace line 193 with one of:\n\n\nMTP : ${MODEL_MTP_FLAGS} ${DECODE_MTP_SIZE}\n\n\nor simply drop the line — Spec args : ${SPEC_ARGS[*]} directly below already covers the same information accurately. Severity is nit because it is purely cosmetic and the correct values are visible one line lower.

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-evals Expand eval selection to every fixed-sequence config AMD evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection full-sweep-enabled

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant