Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions docs/en/quantization/llm_compressor_fp8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# llm-compressor-fp8 Support

This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) tool.

Currently supported `llm-compressor-fp8` quantization types include:

- AWQ、GPTQ

These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:

| Compute Capability | Micro-architecture | GPUs |
| ------------------ | ------------------ | ------------------------------- |
| 7.0 | Volta | V100 |
| 7.2 | Volta | Jetson Xavier |
| 7.5 | Turing | GeForce RTX 20 series, T4 |
| 8.0 | Ampere | A100, A800, A30 |
| 8.6 | Ampere | GeForce RTX 30 series, A40, A10 |
| 8.7 | Ampere | Jetson Orin |
| 8.9 | Ada Lovelace | GeForce RTX 40 series, L40, L20 |
| 9.0 | Hopper | H20, H200, H100, GH200 |
| 12.0 | Blackwell | GeForce RTX 50 series |

LMDeploy will continue to follow up and expand support for the `llm-compressor-fp8` project.

The remainder of this document consists of the following sections:

<!-- toc -->

- [Model Quantization](#model-quantization)
- [Model Deployment](#model-deployment)
- [Accuracy Evaluation](#accuracy-evaluation)

<!-- tocstop -->

## Model Quantization

`llm-compressor-fp8` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.

LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/fp8/qwen3_30b_a3b_fp8.py) for FP8 quantization of **Qwen3-30B-A3B** using `llm-compressor-fp8` for your reference:

```shell
# Create conda environment
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

# Install llm-compressor
pip install llmcompressor

# Clone lmdeploy source code and run the quantization example
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/fp8/qwen3_30b_a3b_fp8.py --work-dir ./qwen3_30b_a3b_fp8

```

In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.

## Model Deployment

### Offline Inference

With the quantized model, offline batch processing can be implemented with just a few lines of code:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_fp8", backend_config=engine_config) as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

For a detailed introduction to the pipeline, please refer to [here](https://lmdeploy.readthedocs.io/en/latest/llm/pipeline.html).

### Online Serving

LMDeploy api_server supports encapsulating the model as a service with a single command. The provided RESTful APIs are compatible with OpenAI interfaces. Below is an example of starting the service:

```shell
lmdeploy serve api_server ./qwen3_30b_a3b_fp8 --backend turbomind
```

The default service port is 23333. After the server starts, you can access the service via the OpenAI SDK. For command arguments and methods to access the service, please read [this](https://lmdeploy.readthedocs.io/en/latest/llm/api_server.html) document.

## Accuracy Evaluation

We deployed FP8-quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, and evaluated them on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). The results show that the accuracy gap between the FP8-quantized models and the BF16 models is not significant, which is in line with expectations.

| dataset | Qwen3-8B | | Qwen3-30B-A3B | |
| ----------------- | -------- | ----- | ------------- | ----- |
| | bf16 | fp8 | bf16 | fp8 |
| ifeval | 85.58 | 87.62 | 86.32 | 86.51 |
| hle | 5.05 | 5.89 | 7.00 | 7.51 |
| gpqa | 59.97 | 59.22 | 61.74 | 60.73 |
| aime2025 | 69.48 | 70.00 | 73.44 | 71.15 |
| mmlu_pro | 73.69 | 73.54 | 77.85 | 77.50 |
| LCBCodeGeneration | 50.86 | 49.81 | 56.67 | 56.86 |

For reproduction methods, please refer to [this](https://lmdeploy.readthedocs.io/en/latest/benchmark/evaluate_with_opencompass.html) document.
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# llm-compressor Support
# llm-compressor-int4 Support

This guide aims to introduce how to use LMDeploy's TurboMind inference engine to run models quantized by the [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) tool.

Currently supported `llm-compressor` quantization types include:
Currently supported `llm-compressor-int4` quantization types include:

- int4 quantization (e.g., AWQ, GPTQ)
- AWQGPTQ

These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:

Expand All @@ -20,7 +20,7 @@ These quantized models can run via the TurboMind engine on the following NVIDIA
| 9.0 | Hopper | H20, H200, H100, GH200 |
| 12.0 | Blackwell | GeForce RTX 50 series |

LMDeploy will continue to follow up and expand support for the `llm-compressor` project.
LMDeploy will continue to follow up and expand support for the `llm-compressor-int4` project.

The remainder of this document consists of the following sections:

Expand All @@ -34,9 +34,9 @@ The remainder of this document consists of the following sections:

## Model Quantization

`llm-compressor` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.
`llm-compressor-int4` provides a wealth of model quantization [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples). Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.

LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/qwen3_30b_a3b_awq.py) for AWQ quantization of **Qwen3-30B-A3B** using `llm-compressor` for your reference:
LMDeploy also provides a built-in [script](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/int4/qwen3_30b_a3b_awq.py) for AWQ quantization of **Qwen3-30B-A3B** using `llm-compressor-int4` for your reference:

```shell
# Create conda environment
Expand All @@ -49,7 +49,8 @@ pip install llmcompressor
# Clone lmdeploy source code and run the quantization example
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
python examples/lite/int4/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq

```

In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.
Expand All @@ -62,7 +63,6 @@ With the quantized model, offline batch processing can be implemented with just

```python
from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_4bit", backend_config=engine_config) as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
Expand All @@ -83,7 +83,7 @@ The default service port is 23333. After the server starts, you can access the s

## Accuracy Evaluation

Aftering deploying AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, we evaluated their accuracy on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.
We deployed AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, and evaluated their accuracy on several academic datasets using [opencompass](https://github.com/open-compass/opencompass). Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.

| dataset | Qwen3-8B | | | Qwen3-30B-A3B | | |
| ----------------- | -------- | ------- | -------- | ------------- | ------- | -------- |
Expand Down
96 changes: 96 additions & 0 deletions docs/zh_cn/quantization/llm_compressor_fp8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# llm-compressor-fp8 支持

本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎,运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具进行fp8量化后的模型。
目前支持的 `llm-compressor-fp8` 量化模型包括:

- AWQ、GPTQ

上述量化模型通过 TurboMind 引擎可以在以下 NVIDIA GPU 架构上运行:

| Compute Capability | Micro-architecture | GPUs |
| ------------------ | ------------------ | ------------------------------- |
| 7.0 | Volta | V100 |
| 7.2 | Volta | Jetson Xavier |
| 7.5 | Turing | GeForce RTX 20 series, T4 |
| 8.0 | Ampere | A100, A800, A30 |
| 8.6 | Ampere | GeForce RTX 30 series, A40, A10 |
| 8.7 | Ampere | Jetson Orin |
| 8.9 | Ada Lovelace | GeForce RTX 40 series, L40, L20 |
| 9.0 | Hopper | H20, H200, H100, GH200 |
| 12.0 | Blackwell | GeForce RTX 50 series |

LMDeploy 将持续跟进并扩展对 `llm-compressor-fp8` 项目的支持。

本文的其余部分由以下章节组成:

<!-- toc -->

- [模型量化](#模型量化)
- [模型部署](#模型部署)
- [精度评测](#精度评测)

<!-- tocstop -->

## 模型量化

`llm-compressor-fp8` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples),请参考其教程选择 LMDeploy 支持的量化算法,完成模型量化工作。
LMDeploy 也内置了通过 `llm-compressor-fp8` 对 Qwen3-30B-A3B 进行 fp8 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/fp8/qwen3_30b_a3b_fp8.py),供大家进行参考:

```shell
# 创建 conda 环境
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

# 安装 llm-compressor
pip install llmcompressor

# 下载 lmdeploy 源码,运行量化用用例
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/fp8/qwen3_30b_a3b_fp8.py --work-dir ./qwen3_30b_a3b_fp8

```

在接下来的章节中,我们以此量化模型为例,介绍模型部署、评测精度等方法

## 模型部署

### 离线推理

量化后的模型,通过以下几行简单的代码,可以实现离线批处理:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_fp8", backend_config=engine_config) as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

关于 pipeline 的详细介绍,请参考[这里](https://lmdeploy.readthedocs.io/zh-cn/latest/llm/pipeline.html)

### 在线服务

LMDeploy api_server 支持把模型一键封装为服务,对外提供的 RESTful API 兼容 openai 的接口。以下为服务启动的示例:

```shell
lmdeploy serve api_server ./qwen3_30b_a3b_fp8 --backend turbomind
```

服务默认端口是23333。在 server 启动后,你可以通过 openai SDK 访问服务。关于服务的命令参数,以及访问服务的方式,可以阅读[这份](https://lmdeploy.readthedocs.io/zh-cn/latest/llm/api_server.html)文档

## 精度评测

我们将 Qwen3-8B (Dense) 与 Qwen3-30B-A3B (MoE) 的 FP8 量化模型通过 LMDeploy 部署为服务,并使用 [opencompass](https://github.com/open-compass/opencompass) 在多个学术数据集上评测。结果显示:Qwen3-8B 与 Qwen3-30B-A3B 的 FP8 量化模型精度与 BF16 模型差异不显著,精度符合预期。

| dataset | Qwen3-8B | | Qwen3-30B-A3B | |
| ----------------- | -------- | ----- | ------------- | ----- |
| | bf16 | fp8 | bf16 | fp8 |
| ifeval | 85.58 | 87.62 | 86.32 | 86.51 |
| hle | 5.05 | 5.89 | 7.00 | 7.51 |
| gpqa | 59.97 | 59.22 | 61.74 | 60.73 |
| aime2025 | 69.48 | 70.00 | 73.44 | 71.15 |
| mmlu_pro | 73.69 | 73.54 | 77.85 | 77.50 |
| LCBCodeGeneration | 50.86 | 49.81 | 56.67 | 56.86 |

复现方式可以参考[这份](https://lmdeploy.readthedocs.io/zh-cn/latest/benchmark/evaluate_with_opencompass.html)文档
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# llm-compressor 支持
# llm-compressor-int4 支持

本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎,运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具量化后的模型
目前支持的 `llm-compressor` 量化模型包括:
本指南旨在介绍如何使用 LMDeploy 的 TurboMind 推理引擎,运行经由 [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)工具进行int4量化后的模型
目前支持的 `llm-compressor-int4` 量化模型包括:

- int4 量化(例如 AWQ、GPTQ
- AWQ、GPTQ

上述量化模型通过 TurboMind 引擎可以在以下 NVIDIA GPU 架构上运行:

Expand All @@ -19,7 +19,7 @@
| 9.0 | Hopper | H20, H200, H100, GH200 |
| 12.0 | Blackwell | GeForce RTX 50 series |

LMDeploy 将持续跟进并扩展对 `llm-compressor` 项目的支持。
LMDeploy 将持续跟进并扩展对 `llm-compressor-int4` 项目的支持。

本文的其余部分由以下章节组成:

Expand All @@ -33,8 +33,8 @@ LMDeploy 将持续跟进并扩展对 `llm-compressor` 项目的支持。

## 模型量化

`llm-compressor` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples),请参考其教程选择 LMDeploy 支持的量化算法,完成模型量化工作。
LMDeploy 也内置了通过 `llm-compressor` 对 Qwen3-30B-A3B 进行 AWQ 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/qwen3_30b_a3b_awq.py),供大家进行参考:
`llm-compressor-int4` 提供了丰富的模型量化[用例](https://github.com/vllm-project/llm-compressor/tree/main/examples),请参考其教程选择 LMDeploy 支持的量化算法,完成模型量化工作。
LMDeploy 也内置了通过 `llm-compressor-int4` 对 Qwen3-30B-A3B 进行 AWQ 量化的[脚本](https://github.com/InternLM/lmdeploy/blob/main/examples/lite/int4/qwen3_30b_a3b_awq.py),供大家进行参考:

```shell
# 创建 conda 环境
Expand All @@ -47,7 +47,7 @@ pip install llmcompressor
# 下载 lmdeploy 源码,运行量化用用例
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq
python examples/lite/int4/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq

```

Expand Down
68 changes: 68 additions & 0 deletions examples/lite/fp8/qwen3_30b_a3b_fp8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import argparse

from compressed_tensors.offload import dispatch_model
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer


def parse_args():
parser = argparse.ArgumentParser(description='Run FP8 quantization for Qwen3 model')

parser.add_argument('--work-dir',
type=str,
default='./qwen3_30b_a3b_fp8',
required=True,
help='The directory to save the quantized model')

parser.add_argument('--model-id',
type=str,
default='Qwen/Qwen3-30B-A3B',
help='The Hugging Face model ID to quantize')
return parser.parse_args()

def main():
# 1. Achieve command args
args = parse_args()
MODEL_ID = args.model_id
SAVE_DIR = args.work_dir

print(f'Loading model: {MODEL_ID}')
print(f'Saving to: {SAVE_DIR}')

# 2. Load_dataset and tokenizer
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype='auto', device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# 3. Configure quant args
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per channel via ptq
# * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
targets='Linear',
scheme='FP8_BLOCK',
ignore=['lm_head', 're:.*mlp.gate$'],
)

# 4. Run quantization
print('Starting quantization...')
oneshot(model=model, recipe=recipe)

# 5. Confirm generations of the quantized model look sane
print('========== SAMPLE GENERATION ==============')
dispatch_model(model)
input_ids = tokenizer('Hello my name is', return_tensors='pt').input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print('==========================================')

# 6. Save quantized model
print('Saving model...')
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

if __name__ == '__main__':
main()
3 changes: 1 addition & 2 deletions lmdeploy/lite/apis/auto_awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,11 @@
import torch
from torch import nn

from lmdeploy.lite.apis.calibrate import LAYER_TYPE_MAP, calibrate
from lmdeploy.lite.quantization.awq import FC_FCS_MAP, NORM_FCS_MAP, awq_layers, quant_weights, smooth_layers
from lmdeploy.lite.utils import collect_target_modules
from lmdeploy.utils import try_import_deeplink

from .calibrate import LAYER_TYPE_MAP, calibrate


def save_vl_model(vl_model, model_path, dst_path):
vl_model.save_pretrained(dst_path, safe_serialization=True)
Expand Down
Loading
Loading