InternLM · 43758726 · Apr 8, 2026 · Apr 8, 2026 · Apr 18, 2026
diff --git a/docs/en/quantization/llm_compressor_fp8.md b/docs/en/quantization/llm_compressor_fp8.md
@@ -0,0 +1,98 @@
+# llm-compressor-fp8 Support
+
+This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) tool.
+
+Currently supported `llm-compressor-fp8` quantization types include:
+
+- AWQ、GPTQ
+
+These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:
+
+| Compute Capability | Micro-architecture | GPUs                            |
+| ------------------ | ------------------ | ------------------------------- |
+| 7.0                | Volta              | V100                            |
+| 7.2                | Volta              | Jetson Xavier                   |
+| 7.5                | Turing             | GeForce RTX 20 series, T4       |
+| 8.0                | Ampere             | A100, A800, A30                 |
+| 8.6                | Ampere             | GeForce RTX 30 series, A40, A10 |
+| 8.7                | Ampere             | Jetson Orin                     |
+| 8.9                | Ada Lovelace       | GeForce RTX 40 series, L40, L20 |
+| 9.0                | Hopper             | H20, H200, H100, GH200          |
+| 12.0               | Blackwell          | GeForce RTX 50 series           |
+
+LMDeploy will continue to follow up and expand support for the `llm-compressor-fp8` project.
+
+The remainder of this document consists of the following sections:
+
+<!-- toc -->
+
+- [Model Quantization](#model-quantization)
+- [Model Deployment](#model-deployment)
+- [Accuracy Evaluation](#accuracy-evaluation)
+
+<!-- tocstop -->
+
+## Model Quantization
+
+`llm-compressor-fp8` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.
+
+LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/fp8/qwen3_30b_a3b_fp8.py) for FP8 quantization of **Qwen3-30B-A3B** using `llm-compressor-fp8` for your reference:
+
+```shell
+# Create conda environment
+conda create -n lmdeploy python=3.10 -y
+conda activate lmdeploy
+
+# Install llm-compressor
+pip install llmcompressor
+
+# Clone lmdeploy source code and run the quantization example
+git clone https://github.com/InternLM/lmdeploy
+cd lmdeploy
+python examples/lite/fp8/qwen3_30b_a3b_fp8.py --work-dir ./qwen3_30b_a3b_fp8
+
+```
+
+In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.
+
+## Model Deployment
+
+### Offline Inference
+
+With the quantized model, offline batch processing can be implemented with just a few lines of code:
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+engine_config = TurbomindEngineConfig()
+with pipeline("./qwen3_30b_a3b_fp8", backend_config=engine_config) as pipe:
+    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+    print(response)
+```
+
+For a detailed introduction to the pipeline, please refer to [here](https://lmdeploy.readthedocs.io/en/latest/llm/pipeline.html).
+
+### Online Serving
+
+LMDeploy api_server supports encapsulating the model as a service with a single command. The provided RESTful APIs are compatible with OpenAI interfaces. Below is an example of starting the service:
+
+```shell
+lmdeploy serve api_server ./qwen3_30b_a3b_fp8 --backend turbomind
+```
+
+The default service port is 23333. After the server starts, you can access the service via the OpenAI SDK. For command arguments and methods to access the service, please read [this](https://lmdeploy.readthedocs.io/en/latest/llm/api_server.html) document.
+
+## Accuracy Evaluation
+
+We deployed FP8-quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, and evaluated them on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). The results show that the accuracy gap between the FP8-quantized models and the BF16 models is not significant, which is in line with expectations.
+
+| dataset           | Qwen3-8B |       | Qwen3-30B-A3B |       |
+| ----------------- | -------- | ----- | ------------- | ----- |
+|                   | bf16     | fp8   | bf16          | fp8   |
+| ifeval            | 85.58    | 87.62 | 86.32         | 86.51 |
+| hle               | 5.05     | 5.89  | 7.00          | 7.51  |
+| gpqa              | 59.97    | 59.22 | 61.74         | 60.73 |
+| aime2025          | 69.48    | 70.00 | 73.44         | 71.15 |
+| mmlu_pro          | 73.69    | 73.54 | 77.85         | 77.50 |
+| LCBCodeGeneration | 50.86    | 49.81 | 56.67         | 56.86 |
+
+For reproduction methods, please refer to [this](https://lmdeploy.readthedocs.io/en/latest/benchmark/evaluate_with_opencompass.html) document.
diff --git a/docs/en/quantization/llm_compressor.md → docs/en/quantization/llm_compressor_int4.md b/docs/en/quantization/llm_compressor.md → docs/en/quantization/llm_compressor_int4.md
@@ -1,10 +1,10 @@
-# llm-compressor Support
+# llm-compressor-int4 Support
 
 This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) tool.
 
-Currently supported `llm-compressor` quantization types include:
+Currently supported `llm-compressor-int4` quantization types include:
 
-- int4 quantization (e.g., AWQ, GPTQ)
+- AWQ、GPTQ
 
 These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:
 
@@ -20,7 +20,7 @@ These quantized models can run via the TurboMind engine on the following NVIDIA
 | 9.0                | Hopper             | H20, H200, H100, GH200          |
 | 12.0               | Blackwell          | GeForce RTX 50 series           |
 
-LMDeploy will continue to follow up and expand support for the `llm-compressor` project.
+LMDeploy will continue to follow up and expand support for the `llm-compressor-int4` project.
 
 The remainder of this document consists of the following sections:
 
@@ -34,9 +34,9 @@ The remainder of this document consists of the following sections:
 
 ## Model Quantization
 
-`llm-compressor` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.
+`llm-compressor-int4` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.
 
-LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/qwen3_30b_a3b_awq.py) for AWQ quantization of **Qwen3-30B-A3B** using `llm-compressor` for your reference:
+LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/int4/qwen3_30b_a3b_awq.py) for AWQ quantization of **Qwen3-30B-A3B** using `llm-compressor-int4` for your reference:
 
 ```shell
 # Create conda environment
@@ -49,7 +49,8 @@ pip install llmcompressor
 # Clone lmdeploy source code and run the quantization example
 git clone https://github.com/InternLM/lmdeploy
 cd lmdeploy
-python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
+python examples/lite/int4/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
+
 ```
 
 In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.
@@ -62,7 +63,6 @@ With the quantized model, offline batch processing can be implemented with just
 
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
-
 engine_config = TurbomindEngineConfig()
 with pipeline("./qwen3_30b_a3b_4bit", backend_config=engine_config) as pipe:
     response = pipe(["Hi, pls intro yourself", "Shanghai is"])
@@ -83,7 +83,7 @@ The default service port is 23333. After the server starts, you can access the s
 
 ## Accuracy Evaluation
 
-Aftering deploying AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, we evaluated their accuracy on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.
+We deployed AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, and evaluated their accuracy on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.
 
 | dataset           | Qwen3-8B |         |          | Qwen3-30B-A3B |         |          |
 | ----------------- | -------- | ------- | -------- | ------------- | ------- | -------- |

diff --git a/docs/zh_cn/quantization/llm_compressor_fp8.md b/docs/zh_cn/quantization/llm_compressor_fp8.md
@@ -0,0 +1,96 @@
+# llm-compressor-fp8 支持
+
+本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎，运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具进行fp8量化后的模型。
+目前支持的 `llm-compressor-fp8` 量化模型包括：
+
+- AWQ、GPTQ
+
+上述量化模型通过 TurboMind 引擎可以在以下 NVIDIA GPU 架构上运行：
+
+| Compute Capability | Micro-architecture | GPUs                            |
+| ------------------ | ------------------ | ------------------------------- |
+| 7.0                | Volta              | V100                            |
+| 7.2                | Volta              | Jetson Xavier                   |
+| 7.5                | Turing             | GeForce RTX 20 series, T4       |
+| 8.0                | Ampere             | A100, A800, A30                 |
+| 8.6                | Ampere             | GeForce RTX 30 series, A40, A10 |
+| 8.7                | Ampere             | Jetson Orin                     |
+| 8.9                | Ada Lovelace       | GeForce RTX 40 series, L40, L20 |
+| 9.0                | Hopper             | H20, H200, H100, GH200          |
+| 12.0               | Blackwell          | GeForce RTX 50 series           |
+
+LMDeploy 将持续跟进并扩展对 `llm-compressor-fp8` 项目的支持。
+
+本文的其余部分由以下章节组成：
+
+<!-- toc -->
+
+- [模型量化](#模型量化)
+- [模型部署](#模型部署)
+- [精度评测](#精度评测)
+
+<!-- tocstop -->
+
+## 模型量化
+
+`llm-compressor-fp8` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples)，请参考其教程选择 LMDeploy 支持的量化算法，完成模型量化工作。
+LMDeploy 也内置了通过 `llm-compressor-fp8` 对 Qwen3-30B-A3B 进行 fp8 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/fp8/qwen3_30b_a3b_fp8.py)，供大家进行参考：
+
+```shell
+# 创建 conda 环境
+conda create -n lmdeploy python=3.10 -y
+conda activate lmdeploy
+
+# 安装 llm-compressor
+pip install llmcompressor
+
+# 下载 lmdeploy 源码，运行量化用用例
+git clone https://github.com/InternLM/lmdeploy
+cd lmdeploy
+python examples/lite/fp8/qwen3_30b_a3b_fp8.py --work-dir ./qwen3_30b_a3b_fp8
+
+```
+
+在接下来的章节中，我们以此量化模型为例，介绍模型部署、评测精度等方法
+
+## 模型部署
+
+### 离线推理
+
+量化后的模型，通过以下几行简单的代码，可以实现离线批处理：
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+engine_config = TurbomindEngineConfig()
+with pipeline("./qwen3_30b_a3b_fp8", backend_config=engine_config) as pipe:
+    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+    print(response)
+```
+
+关于 pipeline 的详细介绍，请参考[这里](https://lmdeploy.readthedocs.io/zh-cn/latest/llm/pipeline.html)
+
+### 在线服务
+
+LMDeploy api_server 支持把模型一键封装为服务，对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例：
+
+```shell
+lmdeploy serve api_server ./qwen3_30b_a3b_fp8 --backend turbomind
+```
+
+服务默认端口是23333。在 server 启动后，你可以通过 openai SDK 访问服务。关于服务的命令参数，以及访问服务的方式，可以阅读[这份](https://lmdeploy.readthedocs.io/zh-cn/latest/llm/api_server.html)文档
+
+## 精度评测
+
+我们将 Qwen3-8B (Dense) 与 Qwen3-30B-A3B (MoE) 的 FP8 量化模型通过 LMDeploy 部署为服务，并使用 [opencompass](https://github.com/open-compass/opencompass) 在多个学术数据集上评测。结果显示：Qwen3-8B 与 Qwen3-30B-A3B 的 FP8 量化模型精度与 BF16 模型差异不显著，精度符合预期。
+
+| dataset           | Qwen3-8B |       | Qwen3-30B-A3B |       |
+| ----------------- | -------- | ----- | ------------- | ----- |
+|                   | bf16     | fp8   | bf16          | fp8   |
+| ifeval            | 85.58    | 87.62 | 86.32         | 86.51 |
+| hle               | 5.05     | 5.89  | 7.00          | 7.51  |
+| gpqa              | 59.97    | 59.22 | 61.74         | 60.73 |
+| aime2025          | 69.48    | 70.00 | 73.44         | 71.15 |
+| mmlu_pro          | 73.69    | 73.54 | 77.85         | 77.50 |
+| LCBCodeGeneration | 50.86    | 49.81 | 56.67         | 56.86 |
+
+复现方式可以参考[这份](https://lmdeploy.readthedocs.io/zh-cn/latest/benchmark/evaluate_with_opencompass.html)文档
diff --git a/docs/zh_cn/quantization/llm_compressor.md → ...zh_cn/quantization/llm_compressor_int4.md b/docs/zh_cn/quantization/llm_compressor.md → ...zh_cn/quantization/llm_compressor_int4.md
@@ -1,9 +1,9 @@
-# llm-compressor 支持
+# llm-compressor-int4 支持
 
-本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎，运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具量化后的模型。
-目前支持的 `llm-compressor` 量化模型包括：
+本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎，运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具进行int4量化后的模型。
+目前支持的 `llm-compressor-int4` 量化模型包括：
 
-- int4 量化（例如 AWQ、GPTQ）
+- AWQ、GPTQ
 
 上述量化模型通过 TurboMind 引擎可以在以下 NVIDIA GPU 架构上运行：
 
@@ -19,7 +19,7 @@
 | 9.0                | Hopper             | H20, H200, H100, GH200          |
 | 12.0               | Blackwell          | GeForce RTX 50 series           |
 
-LMDeploy 将持续跟进并扩展对 `llm-compressor` 项目的支持。
+LMDeploy 将持续跟进并扩展对 `llm-compressor-int4` 项目的支持。
 
 本文的其余部分由以下章节组成：
 
@@ -33,8 +33,8 @@ LMDeploy 将持续跟进并扩展对 `llm-compressor` 项目的支持。
 
 ## 模型量化
 
-`llm-compressor` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples)，请参考其教程选择 LMDeploy 支持的量化算法，完成模型量化工作。
-LMDeploy 也内置了通过 `llm-compressor` 对 Qwen3-30B-A3B 进行 AWQ 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/qwen3_30b_a3b_awq.py)，供大家进行参考：
+`llm-compressor-int4` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples)，请参考其教程选择 LMDeploy 支持的量化算法，完成模型量化工作。
+LMDeploy 也内置了通过 `llm-compressor-int4` 对 Qwen3-30B-A3B 进行 AWQ 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/int4/qwen3_30b_a3b_awq.py)，供大家进行参考：
 
 ```shell
 # 创建 conda 环境
@@ -47,7 +47,7 @@ pip install llmcompressor
 # 下载 lmdeploy 源码，运行量化用用例
 git clone https://github.com/InternLM/lmdeploy
 cd lmdeploy
-python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
+python examples/lite/int4/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
 
 ```
 

diff --git a/examples/lite/fp8/qwen3_30b_a3b_fp8.py b/examples/lite/fp8/qwen3_30b_a3b_fp8.py
@@ -0,0 +1,68 @@
+import argparse
+
+from compressed_tensors.offload import dispatch_model
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Run FP8 quantization for Qwen3 model')
+
+    parser.add_argument('--work-dir',
+                        type=str,
+                        default='./qwen3_30b_a3b_fp8',
+                        required=True,
+                        help='The directory to save the quantized model')
+
+    parser.add_argument('--model-id',
+                        type=str,
+                        default='Qwen/Qwen3-30B-A3B',
+                        help='The Hugging Face model ID to quantize')
+    return parser.parse_args()
+
+def main():
+    # 1. Achieve command args
+    args = parse_args()
+    MODEL_ID = args.model_id
+    SAVE_DIR = args.work_dir
+
+    print(f'Loading model: {MODEL_ID}')
+    print(f'Saving to: {SAVE_DIR}')
+
+    # 2. Load_dataset and tokenizer
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype='auto', device_map='auto', trust_remote_code=True)
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+
+    # 3. Configure quant args
+    # Configure the quantization algorithm and scheme.
+    # In this case, we:
+    #   * quantize the weights to fp8 with per channel via ptq
+    #   * quantize the activations to fp8 with dynamic per token
+    recipe = QuantizationModifier(
+        targets='Linear',
+        scheme='FP8_BLOCK',
+        ignore=['lm_head', 're:.*mlp.gate$'],
+    )
+
+    # 4. Run quantization
+    print('Starting quantization...')
+    oneshot(model=model, recipe=recipe)
+
+    # 5. Confirm generations of the quantized model look sane
+    print('========== SAMPLE GENERATION ==============')
+    dispatch_model(model)
+    input_ids = tokenizer('Hello my name is', return_tensors='pt').input_ids.to(
+        model.device
+    )
+    output = model.generate(input_ids, max_new_tokens=20)
+    print(tokenizer.decode(output[0]))
+    print('==========================================')
+
+    # 6. Save quantized model
+    print('Saving model...')
+    model.save_pretrained(SAVE_DIR)
+    tokenizer.save_pretrained(SAVE_DIR)
+
+if __name__ == '__main__':
+    main()
diff --git a/examples/lite/qwen3_30b_a3b_awq.py → examples/lite/int4/qwen3_30b_a3b_awq.py b/examples/lite/qwen3_30b_a3b_awq.py → examples/lite/int4/qwen3_30b_a3b_awq.py
diff --git a/examples/lite/qwen3_30b_a3b_gptq.py → examples/lite/int4/qwen3_30b_a3b_gptq.py b/examples/lite/qwen3_30b_a3b_gptq.py → examples/lite/int4/qwen3_30b_a3b_gptq.py
diff --git a/lmdeploy/lite/apis/auto_awq.py b/lmdeploy/lite/apis/auto_awq.py
@@ -8,12 +8,11 @@
 import torch
 from torch import nn
 
+from lmdeploy.lite.apis.calibrate import LAYER_TYPE_MAP, calibrate
 from lmdeploy.lite.quantization.awq import FC_FCS_MAP, NORM_FCS_MAP, awq_layers, quant_weights, smooth_layers
 from lmdeploy.lite.utils import collect_target_modules
 from lmdeploy.utils import try_import_deeplink
 
-from .calibrate import LAYER_TYPE_MAP, calibrate
-
 
 def save_vl_model(vl_model, model_path, dst_path):
     vl_model.save_pretrained(dst_path, safe_serialization=True)