This repository was archived by the owner on May 27, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 32
update readme for vLLM 0.17.0 release on Intel GPU #971
Merged
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,254 @@ | ||
| # Optimize LLM Serving with vLLM on Intel® GPUs | ||
|
|
||
| vLLM is a fast and easy-to-use library for LLM inference and serving. It has grown into a community-driven project with contributions from both academia and industry. Intel, as an active community contributor, continues to improve vLLM performance and usability on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, and Intel® Gaudi® AI accelerators. This document focuses on Intel® discrete GPUs and provides the information needed to run these workloads effectively on Intel® graphics cards. | ||
|
|
||
| This release is the first to switch to the optimized kernel library [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels) for Intel® GPUs. The vLLM build included in this container uses the same code base as [v0.17.0](https://github.com/vllm-project/vllm/tree/v0.17.0) and has been validated on [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) cards. The following bill of materials was used for validation: | ||
|
|
||
| | Ingredients | Version | | ||
| | --- | --- | | ||
| | Host OS | Ubuntu 25.04 | | ||
| | Python | 3.12 | | ||
| | KMD Driver | 6.14.0 | | ||
| | oneAPI | 2025.3.2.4 with hotfix | | ||
| | PyTorch | 2.10 | | ||
| | vllm-xpu-kernels | 0.14.0 | | ||
| | oneCCL | 2021.15.7.8 | | ||
|
|
||
| ## 1. What's Supported? | ||
|
|
||
| This release supports core vLLM serving capabilities on Intel® GPUs, including online FP8 quantization, multimodal models, pooling models, and multi-GPU scaling strategies. In addition to dense-model serving, it also includes experimental expert parallelism and validated support for MoE models. | ||
|
|
||
| | Feature | Description | Note | | ||
| | --- | --- | --- | | ||
| | FP8 Online Quantization | vLLM supports weight-only online dynamic quantization with FP8, enabling up to a 2x reduction in model memory requirements and up to a 1.6x throughput improvement with minimal accuracy impact. Models in BF16 or FP16 can be quantized dynamically to FP8 without calibration data. | See the [example](https://docs.vllm.ai/en/stable/features/quantization/fp8/?h=online+dynamic#online-dynamic-quantization). | | ||
| | Multi-Modality Support | We support most of the popular multimodal models in upstream's [list](https://docs.vllm.ai/en/stable/models/supported_models/#list-of-multimodal-language-models), such as Qwen VL series, InternVL series, whisper-large-v3, DeepSeek-OCR, and PaddleOCR-VL. | For example, `Qwen/Qwen2.5-VL-32B-Instruct` can be launched on 4 Intel® Arc™ Pro B60 Graphics cards for multimodal processing. | | ||
| | Pooling Models Support | vLLM supports pooling models such as embedding, classification, and reward models. All of these models are now supported on Intel® GPUs. | For detailed usage, refer to the [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html). | | ||
| | Pipeline Parallelism | Pipeline parallelism distributes model layers across multiple GPUs, with each GPU processing a different stage of the model in sequence. | On Intel® GPUs, this is supported on a single node with `mp` as the backend. | | ||
| | Data Parallelism | vLLM supports [Data Parallelism](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html), where model weights are replicated across separate instances or GPUs to process independent request batches. | Supports both dense and MoE models. | | ||
| | Expert Parallelism | Experimental support for [Expert Parallelism](https://docs.vllm.ai/en/stable/serving/expert_parallel_deployment), which allows experts in Mixture-of-Experts (MoE) models to be deployed across separate GPUs. | In this release, `TP+DP+EP` is supported. | | ||
|
|
||
| In addition, features such as [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html), and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported. The following experimental features are also available: | ||
|
|
||
| * **torch.compile**: Can be enabled for the FP16/BF16 path. | ||
| * **speculative decoding**: Supports methods `n-gram`, `EAGLE`, `EAGLE3`, `medusa` and `suffix`. For detailed usage, refer [document](https://docs.vllm.ai/en/stable/features/speculative_decoding/). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. refer -> refer to |
||
| * **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Async scheduling is not supported in this release.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's disabled by default but user can explicitly set it. It doesn't fail in all cases so I think we can call it experimental. |
||
|
|
||
| ## 2. Supported Models | ||
|
|
||
| Please note that the following table contains only the models verified by Intel. Support on Intel® GPUs through vLLM extends to a wider array of models. | ||
|
|
||
| ### Text Generation Models | ||
|
|
||
| These models primarily accept the LLM.generate API. Chat/Instruct models additionally support the LLM.chat API. | ||
|
|
||
| | Model (company/model name) | BF16/FP16 | Dynamic Online FP8 | MXFP4 | | ||
| |-------------------------------------------| --- | --- | -- | | ||
| | openai/gpt-oss-20b | | |✅︎| | ||
| | openai/gpt-oss-120b | | |✅︎| | ||
| | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |✅︎|✅︎| | | ||
| | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |✅︎|✅︎| | | ||
| | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |✅︎|✅︎| | | ||
| | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |✅︎|✅︎| | | ||
| | Qwen/Qwen2.5-72B-Instruct |✅︎|✅︎| | | ||
| | Qwen/Qwen3-14B |✅︎|✅︎| | | ||
| | Qwen/Qwen3-32B |✅︎|✅︎| | | ||
| | Qwen/Qwen3-30B-A3B |✅︎|✅︎| | | ||
| | Qwen/Qwen3-30B-A3B-GPTQ-Int4 |✅︎|✅︎| | | ||
| | Qwen/Qwen3-coder-30B-A3B-Instruct |✅︎|✅︎| | | ||
| | Qwen/QwQ-32B |✅︎|✅︎| | | ||
| | openbmb/MiniCPM-V-4 |✅︎|✅︎| | | ||
| | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎| | | ||
| | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎| | | ||
| | THUDM/GLM-4-9B-chat |✅︎|✅︎| | | ||
| | THUDM/GLM-4v-9B-chat |✅︎|✅︎| | | ||
| | THUDM/CodeGeex4-All-9B |✅︎|✅︎| | | ||
| | chuhac/TeleChat2-35B |✅︎|✅︎| | | ||
| | 01-ai/Yi1.5-34B-Chat |✅︎|✅︎| | | ||
| | THUDM/CodeGeex4-All-9B |✅︎|✅︎| | | ||
| | deepseek-ai/DeepSeek-Coder-33B-base |✅︎|✅︎| | | ||
| | meta-llama/Llama-2-13b-chat-hf |✅︎|✅︎| | | ||
| | Qwen/Qwen1.5-14B-Chat |✅︎|✅︎| | | ||
| | Qwen/Qwen1.5-32B-Chat |✅︎|✅︎| | | ||
|
|
||
| ### Multimodal Models | ||
|
|
||
| The modalities(text, image, video, audio) are supported depending on the model: | ||
|
|
||
| | Model (company/model name) | BF16/FP16 | Dynamic Online FP8 | Text | Image | Video | Audio | | ||
| |-------------------------------------------| --- | --- | -- | -- | -- | -- | | ||
| | openai/whisper-large-v3 |✅︎| | | | |✅︎| | ||
| | deepseek-ai/DeepSeek-OCR |✅︎|✅︎|✅︎|✅︎| | | | ||
| | PaddlePaddle/PaddleOCR-VL |✅︎|✅︎|✅︎|✅︎| | | | ||
| | Qwen/Qwen2-VL-7B-Instruct |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | Qwen/Qwen2.5-VL-72B-Instruct |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | Qwen/Qwen2.5-VL-32B-Instruct |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | OpenGVLab/InternVL3_5-8B |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | OpenGVLab/InternVL3_5-14B |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | OpenGVLab/InternVL3_5-38B |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | OpenGVLab/InternVL3_5-30B-A3B |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
| | THUDM/GLM-4v-9B |✅︎|✅︎|✅︎|✅︎| | | | ||
| | openbmb/MiniCPM-V-4 |✅︎|✅︎|✅︎|✅︎|✅︎| | | ||
|
|
||
| ### Pooling Models | ||
|
|
||
| These models primarily support the LLM.embed API. The following table lists those that are tested on XPU. | ||
|
|
||
| | Model Type | Model (company/model name) | BF16 | Dynamic Online FP8 | | ||
| |-----------------|-------------------------------------------| --- | --- | | ||
| | Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎| | ||
| | Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎| | ||
|
|
||
| ## 3. Limitations | ||
|
|
||
| Some vLLM features still require additional enablement or refinement and are not included in current release, like LoRA (Low-Rank Adaptation), pipeline parallelism on Ray, and MLA (Multi-head Latent Attention). CPU KV-cache offloading also needs further refinement due to kernel migration. | ||
|
|
||
| The following items are also known issues: | ||
|
|
||
| * There is an approximately 20% performance gap compared with previous IPEX releases. | ||
| * Set the `SYCL_UR_USE_LEVEL_ZERO_V2=0` environment variable to avoid unexpected OOM errors during inference. | ||
| * Set block size to `64` for better accuracy. | ||
| * For `Qwen/Qwen3-30B-A3B` in FP16/BF16, set `PYTORCH_ALLOC_CONF=expandable_segments:True` or `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to enable expandable blocks in the cache allocator. | ||
| * W8A8 quantized models generated with `llm_compressor` are not supported yet, such as `RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic`. | ||
|
|
||
| ## 4. How to Get Started | ||
|
|
||
| ### 4.1. Prerequisite | ||
|
|
||
| | OS | Hardware | | ||
| | ---------- | ---------- | | ||
| | Ubuntu 25.04 | Intel® Arc™ B-Series | | ||
|
|
||
| ### 4.2. Prepare a Serving Environment | ||
|
|
||
| 1. Pull the released Docker image: | ||
|
|
||
| ```bash | ||
| docker pull intel/vllm:0.17.0-xpu | ||
| ``` | ||
|
|
||
| 2. Start a container: | ||
|
|
||
| ```bash | ||
| docker run -t -d --shm-size 10g --net=host --ipc=host --privileged \ | ||
| -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test \ | ||
| --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.17.0-xpu /bin/bash | ||
| ``` | ||
|
|
||
| 3. Open two terminals and run `docker exec -it vllm-test bash` in both of them. Use one terminal for the server and the other for the client. | ||
|
|
||
| From this point on, all commands are expected to be run inside the Docker container unless noted otherwise. | ||
|
|
||
| In both environments, you may want to set the `HUGGING_FACE_HUB_TOKEN` environment variable to ensure that required files can be downloaded from Hugging Face. | ||
|
|
||
| ```bash | ||
| export HUGGING_FACE_HUB_TOKEN=xxxxxx | ||
| ``` | ||
|
|
||
| ### 4.3. Launch Workloads | ||
|
|
||
| #### 4.3.1. Launch Server in the Server Environment | ||
|
|
||
| Command: | ||
|
|
||
| ```bash | ||
| VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ | ||
| --dtype=float16 \ | ||
| --enforce-eager \ | ||
| --port 8000 \ | ||
| --block-size 64 \ | ||
| --gpu-memory-util 0.9 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code \ | ||
| --disable-sliding-window \ | ||
| --disable-log-requests \ | ||
| --max-num-batched-tokens=8192 \ | ||
| --max-model-len 4096 \ | ||
| -tp=4 \ | ||
| --quantization fp8 | ||
| ``` | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| INFO 03-20 03:20:29 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000 | ||
| INFO 03-20 03:20:29 launcher.py:23] Available routes are: | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /docs, Methods: HEAD, GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /redoc, Methods: HEAD, GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /health, Methods: GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /ping, Methods: POST, GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /tokenize, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /detokenize, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/models, Methods: GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /version, Methods: GET | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/chat/completions, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/completions, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/embeddings, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /pooling, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /score, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/score, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /rerank, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v1/rerank, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /v2/rerank, Methods: POST | ||
| INFO 03-20 03:20:29 launcher.py:31] Route: /invocations, Methods: POST | ||
| INFO: Started server process [1636943] | ||
| INFO: Waiting for application startup. | ||
| INFO: Application startup complete. | ||
| ``` | ||
|
|
||
| Startup may take some time. When `INFO: Application startup complete.` appears, the server is ready. | ||
|
|
||
| #### 4.3.2. Raise Requests for Benchmarking in the Client Environment | ||
|
|
||
| Use the following command to send benchmark requests: | ||
|
|
||
| ```bash | ||
| vllm bench serve \ | ||
| --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ | ||
| --dataset-name random \ | ||
| --random-input-len=1024 \ | ||
| --random-output-len=1024 \ | ||
| --ignore-eos \ | ||
| --num-prompt 16 \ | ||
| --max-concurrency 16 \ | ||
| --request-rate inf \ | ||
| --backend vllm \ | ||
| --port=8000 \ | ||
| --host 0.0.0.0 \ | ||
| --ready-check-timeout-sec 1 | ||
| ``` | ||
|
|
||
| This command uses the `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` model. Both the input and output token lengths are set to `1024`, and up to `16` requests are processed concurrently by the server. | ||
|
|
||
| Expected output: | ||
|
|
||
| ```bash | ||
| Maximum request concurrency: 16 | ||
| ============ Serving Benchmark Result ============ | ||
| Successful requests: 1 | ||
| Benchmark duration (s): xxx | ||
| Total input tokens: 1024 | ||
| Total generated tokens: 1024 | ||
| Request throughput (req/s): xxx | ||
| Output token throughput (tok/s): xxx | ||
| Total Token throughput (tok/s): xxx | ||
| ---------------Time to First Token---------------- | ||
| Mean TTFT (ms): xxx | ||
| Median TTFT (ms): xxx | ||
| P99 TTFT (ms): xxx | ||
| -----Time per Output Token (excl. 1st token)------ | ||
| Mean TPOT (ms): xxx | ||
| Median TPOT (ms): xxx | ||
| P99 TPOT (ms): xxx | ||
| ---------------Inter-token Latency---------------- | ||
| Mean ITL (ms): xxx | ||
| Median ITL (ms): xxx | ||
| P99 ITL (ms): xxx | ||
| ================================================== | ||
| ``` | ||
|
|
||
| ## 5. Need Assistance? | ||
|
|
||
| Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1.4