Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .claude/skills/deployment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,10 @@ For NEL-managed deployment (evaluation with self-deployment), use the evaluation
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |

## Unsupported Models

If the model is not in the validated support matrix (`references/support-matrix.md`), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read `references/unsupported-models.md` for the iterative debug loop: **run → read error → diagnose → patch framework source → re-run**. For kernel-level issues, escalate to the framework team rather than attempting fixes.

## Success Criteria

1. Server process is running and healthy (`/health` returns 200)
Expand Down
63 changes: 63 additions & 0 deletions .claude/skills/deployment/references/unsupported-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Deploying Unsupported Models

When deploying a model not in the validated support matrix (`references/support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this reference doc, the path references/support-matrix.md is relative to the current file, which already lives under references/. On GitHub this will resolve to references/references/support-matrix.md (broken link). Use a relative link to the sibling file instead (e.g., support-matrix.md).

Suggested change
When deploying a model not in the validated support matrix (`references/support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
When deploying a model not in the validated support matrix (`support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.

Copilot uses AI. Check for mistakes.

## Step 1 — Run and collect the error

Submit the deployment job. When it fails, read the full log — focus on the **first** error traceback (not "See root cause above" wrappers). Identify the file and line number in the framework source.

## Step 2 — Diagnose the root cause

Fetch the framework source at the failing line (use `gh api` for the tagged version, or `find` inside the container). Common error categories:

| Category | Symptoms | Examples |
|----------|----------|----------|
| **Weight key mismatch** | `KeyError`, `Unexpected key`, `Missing key` during weight loading | Checkpoint uses `model.language_model.layers.*` but framework expects `model.layers.*`. See [vllm#39406](https://github.com/vllm-project/vllm/pull/39406) |
| **Quantized/unquantized layer confusion** | Wrong layer type loaded, dtype errors, shape mismatches | Framework tries to load unquantized layers with FP4 kernel due to overly broad `quantization_config.ignore` patterns or missing ignore entries. See [sglang#18937](https://github.com/sgl-project/sglang/pull/18937) |
| **Missing architecture support** | `NoneType is not iterable`, `KeyError` on model type, unknown architecture | Framework's model handler doesn't recognize the text backbone type (e.g., `ministral3` not handled in vLLM's `mistral3.py` init). Fix: extend the model type mapping |
| **Transformers version mismatch** | `ImportError`, `KeyError` on config fields | Framework ships with older transformers that doesn't know the model type. Fix: upgrade transformers after installing the framework |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix grammar in the transformers mismatch guidance.

The sentence uses incorrect subject-verb agreement: “transformers that doesn’t know”. Update to “transformers that don’t know”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/references/unsupported-models.md at line 18,
Update the grammar in the table row titled "Transformers version mismatch":
change the phrase "transformers that doesn't know the model type" to
"transformers that don't know the model type" so the subject-verb agreement is
correct; edit the string in the markdown table entry under the error explanation
for Transformers version mismatch.

| **Kernel-level issues** | CUDA errors, `triton` import failures, unsupported ops | Framework lacks kernel support for this model + quantization combo |

## Step 3 — Apply a targeted fix

Focus on **small, targeted patches** to the framework source. Do not modify `config.json` or the checkpoint — fix the framework's handling instead.

### Weight key mismatches and architecture mapping gaps

Patch the framework source in the run script using `sed` or a Python one-liner. Keep patches minimal — change only what's needed to unblock the current error.

```bash
# Example: extend model type mapping in vLLM mistral3.py
FRAMEWORK_FILE=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
sed -i 's/old_pattern/new_pattern/' "${FRAMEWORK_FILE}"
```

> **Tip**: when locating framework source files inside containers, use `find` instead of Python import — some frameworks print log messages to stdout during import that can corrupt captured paths.
### Quantized/unquantized layer confusion

Check `hf_quant_config.json` ignore patterns against the framework's weight loading logic. The framework may try to load layers listed in `ignore` with quantized kernels, or vice versa. Fix by adjusting the framework's layer filtering logic.

### Kernel-level issues

These require framework kernel team involvement. Do NOT attempt to patch kernels. Instead:

1. Document the exact error (model, format, framework version, GPU type)
2. Inform the user: *"This model + quantization combination requires kernel support that isn't available in {framework} v{version}. I'd suggest reaching out to the {framework} kernel team or trying a different framework."*
3. Suggest trying an alternative framework (vLLM → SGLang → TRT-LLM)

## Step 4 — Re-run and iterate

After applying a fix, resubmit the job. Each iteration may reveal a new error (e.g., fixing the init error exposes a weight loading error). Continue the loop: **run → read error → diagnose → patch → re-run**.

Typical iteration count: 1-3 for straightforward fixes, 3-5 for models requiring multiple patches.

## Step 5 — Know when to stop

**Stop patching and escalate** when:

- The error is in compiled CUDA kernels or triton ops (not Python-level)
- The fix requires changes to core framework abstractions (not just model handlers)
- You've done 5+ iterations without the server starting

In these cases, inform the user and suggest: trying a different framework, checking for a newer framework version, or filing an issue with the framework team.
Loading