diff --git a/vllm/0.14.1-xpu.md b/vllm/0.14.1-xpu.md index 74e0460d..6751afe5 100755 --- a/vllm/0.14.1-xpu.md +++ b/vllm/0.14.1-xpu.md @@ -9,7 +9,7 @@ The vLLM used in this docker image has same code base as [v0.14.1](https://githu | Host OS   | Ubuntu 25.04 | | Python   | 3.12 | | KMD Driver | 6.14.0 | -| OneAPI   | 2025.3.2.4 with hotfix | +| OneAPI   | 2025.3.2 with hotfix | | PyTorch   | PyTorch 2.10 | | IPEX   | 2.10.10 | | OneCCL   | 2021.15.7.8 | diff --git a/vllm/0.17.0-xpu.md b/vllm/0.17.0-xpu.md index 4bdbb6cb..026a023f 100755 --- a/vllm/0.17.0-xpu.md +++ b/vllm/0.17.0-xpu.md @@ -9,7 +9,7 @@ This release is the first to switch to the optimized kernel library [vllm-xpu-ke | Host OS | Ubuntu 25.04 | | Python | 3.12 | | KMD Driver | 6.14.0 | -| oneAPI | 2025.3.2.4 with hotfix | +| oneAPI | 2025.3.2 with hotfix | | PyTorch | 2.10 | | vllm-xpu-kernels | 0.1.4 | | oneCCL | 2021.15.7.8 | @@ -130,7 +130,7 @@ The following items are also known issues: ```bash docker run -t -d --shm-size 10g --net=host --ipc=host --privileged \ -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test \ - --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.17.0-xpu /bin/bash + --device /dev/dri:/dev/dri --entrypoint=/bin/bash intel/vllm:0.17.0-xpu ``` 3. Open two terminals and run `docker exec -it vllm-test bash` in both of them. Use one terminal for the server and the other for the client. diff --git a/vllm/xpu.md b/vllm/xpu.md index 74e0460d..026a023f 100644 --- a/vllm/xpu.md +++ b/vllm/xpu.md @@ -1,82 +1,39 @@ # Optimize LLM Serving with vLLM on Intel® GPUs -vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaudi® AI accelerators. This readme focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards. +vLLM is a fast and easy-to-use library for LLM inference and serving. It has grown into a community-driven project with contributions from both academia and industry. Intel, as an active community contributor, continues to improve vLLM performance and usability on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, and Intel® Gaudi® AI accelerators. This document focuses on Intel® discrete GPUs and provides the information needed to run these workloads effectively on Intel® graphics cards. -The vLLM used in this docker image has same code base as [v0.14.1](https://github.com/vllm-project/vllm/tree/v0.14.1) and validated on [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) Cards. It uses following BKC: +This release is the first to switch to the optimized kernel library [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels) for Intel® GPUs. The vLLM build included in this container uses the same code base as [v0.17.0](https://github.com/vllm-project/vllm/tree/v0.17.0) and has been validated on [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) cards. The following bill of materials was used for validation: -| Ingredients | Version | -|-------------|------------------| -| Host OS   | Ubuntu 25.04 | -| Python   | 3.12 | -| KMD Driver | 6.14.0 | -| OneAPI   | 2025.3.2.4 with hotfix | -| PyTorch   | PyTorch 2.10 | -| IPEX   | 2.10.10 | -| OneCCL   | 2021.15.7.8 | +| Ingredients | Version | +| --- | --- | +| Host OS | Ubuntu 25.04 | +| Python | 3.12 | +| KMD Driver | 6.14.0 | +| oneAPI | 2025.3.2 with hotfix | +| PyTorch | 2.10 | +| vllm-xpu-kernels | 0.1.4 | +| oneCCL | 2021.15.7.8 | -## 1. What's New in This Release? +## 1. What's Supported? -* INT4 oneDNN optimizations are integrated, delivering geomean about 4.5% improvement in end-to-end throughput. -* The oneAPI is uplifted to version 2025.3 with official support for UR adaptor v2. +This release supports core vLLM serving capabilities on Intel® GPUs, including online FP8 quantization, multimodal models, pooling models, and multi-GPU scaling strategies. In addition to dense-model serving, it also includes experimental expert parallelism and validated support for MoE models. -Note that this release is the last to include IPEX. The next release will migrate vLLM kernels from IPEX to vllm-xpu-kernels, removing the IPEX dependency going forward. +| Feature | Description | Note | +| --- | --- | --- | +| FP8 Online Quantization | vLLM supports weight-only online dynamic quantization with FP8, enabling up to a 2x reduction in model memory requirements and up to a 1.6x throughput improvement with minimal accuracy impact. Models in BF16 or FP16 can be quantized dynamically to FP8 without calibration data. | See the [example](https://docs.vllm.ai/en/stable/features/quantization/fp8/?h=online+dynamic#online-dynamic-quantization). | +| Multi-Modality Support | We support most of the popular multimodal models in upstream's [list](https://docs.vllm.ai/en/stable/models/supported_models/#list-of-multimodal-language-models), such as Qwen VL series, InternVL series, whisper-large-v3, DeepSeek-OCR, and PaddleOCR-VL. | For example, `Qwen/Qwen2.5-VL-32B-Instruct` can be launched on 4 Intel® Arc™ Pro B60 Graphics cards for multimodal processing. | +| Pooling Models Support | vLLM supports pooling models such as embedding, classification, and reward models. All of these models are now supported on Intel® GPUs. | For detailed usage, refer to the [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html). | +| Pipeline Parallelism | Pipeline parallelism distributes model layers across multiple GPUs, with each GPU processing a different stage of the model in sequence. | On Intel® GPUs, this is supported on a single node with `mp` as the backend. | +| Data Parallelism | vLLM supports [Data Parallelism](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html), where model weights are replicated across separate instances or GPUs to process independent request batches. | Supports both dense and MoE models. | +| Expert Parallelism | Experimental support for [Expert Parallelism](https://docs.vllm.ai/en/stable/serving/expert_parallel_deployment), which allows experts in Mixture-of-Experts (MoE) models to be deployed across separate GPUs. | In this release, `TP+DP+EP` is supported. | -## 2. What's Supported? +In addition, features such as [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html), and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported. The following experimental features are also available: -Following up vLLM V1 design, corresponding optimized kernels and features are implemented for Intel GPUs. - -* Chunked prefill: - - Chunked prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution. - -* FP8 W8A16 MatMul: - - vLLM supports FP8 (8-bit floating point) weight using hardware acceleration on GPUs. We support weight-only online dynamic quantization with FP8, which allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. - - Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor. - - Besides, the FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios: - - * **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`. - * **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values. - - We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`. - - :::{warning} - Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device. - ::: - -* Multi-Modality Support - - We support most of the popular multi-modality models in upstream's [list](https://docs.vllm.ai/en/latest/models/supported_models/?h=multim#list-of-multimodal-language-models), such as Qwen VL series, InternVL series, whisper-large-v3, DeepSeek-OCR and PaddleOCR-VL, etc. For example, the Qwen/Qwen2.5-VL-32B-Instruct model can be launched on 4 Intel® Arc™ Pro B60 Graphics cards for the multi modality process. - -* Pooling Models Support - - vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer to [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html). - -* Pipeline Parallelism - - Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence. For Intel® GPUs, we support it on single node with `mp` as the backend. - -* Data Parallelism - - vLLM supports [Data Parallelism](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. - -* Expert Parallelism - - Experimental support for [Expert Parallelism](https://docs.vllm.ai/en/stable/serving/expert_parallel_deployment), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs. It typically works together with Data Parallelism. In this release, we support TP+EP and DP+EP scenarios. - -* MoE models - - Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite, Qwen/Qwen3-30B-A3B and Qwen3-30B-A3B-GPTQ-Int4 are supported. Qwen3-NEXT-80B-A3B-Instruct and Qwen3-NEXT-80B-A3B-Thinking are also supported using online fp8 quantization. - -Besides features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html), cpu kv cache offloading is also supported to better handle preemptions. We also have some experimental features supported, including: - -* **torch.compile**: Can be enabled for fp16/bf16 path. -* **speculative decoding**: Supports methods `n-gram`, `EAGLE`, `EAGLE3`, `medusa` and `suffix`. +* **torch.compile**: Can be enabled for the FP16/BF16 path. +* **speculative decoding**: Supports methods `n-gram`, `EAGLE`, `EAGLE3`, `medusa` and `suffix`. For detailed usage, refer to [document](https://docs.vllm.ai/en/stable/features/speculative_decoding/). * **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. -## Supported Models +## 2. Supported Models Please note that the following table contains only the models verified by Intel. Support on Intel® GPUs through vLLM extends to a wider array of models. @@ -86,8 +43,6 @@ These models primarily accept the LLM.generate API. Chat/Instruct models additio | Model (company/model name) | BF16/FP16 | Dynamic Online FP8 | MXFP4 | |-------------------------------------------| --- | --- | -- | -| Qwen/Qwen3-Next-80B-A3B-Instruct | |✅︎| | -| Qwen/Qwen3-Next-80B-A3B-Thinking | |✅︎| | | openai/gpt-oss-20b | | |✅︎| | openai/gpt-oss-120b | | |✅︎| | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |✅︎|✅︎| | @@ -104,7 +59,6 @@ These models primarily accept the LLM.generate API. Chat/Instruct models additio | openbmb/MiniCPM-V-4 |✅︎|✅︎| | | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎| | | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎| | -| baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎| | | THUDM/GLM-4-9B-chat |✅︎|✅︎| | | THUDM/GLM-4v-9B-chat |✅︎|✅︎| | | THUDM/CodeGeex4-All-9B |✅︎|✅︎| | @@ -112,7 +66,6 @@ These models primarily accept the LLM.generate API. Chat/Instruct models additio | 01-ai/Yi1.5-34B-Chat |✅︎|✅︎| | | THUDM/CodeGeex4-All-9B |✅︎|✅︎| | | deepseek-ai/DeepSeek-Coder-33B-base |✅︎|✅︎| | -| baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎| | | meta-llama/Llama-2-13b-chat-hf |✅︎|✅︎| | | Qwen/Qwen1.5-14B-Chat |✅︎|✅︎| | | Qwen/Qwen1.5-32B-Chat |✅︎|✅︎| | @@ -133,7 +86,6 @@ The modalities(text, image, video, audio) are supported depending on the model: | OpenGVLab/InternVL3_5-14B |✅︎|✅︎|✅︎|✅︎|✅︎| | | OpenGVLab/InternVL3_5-38B |✅︎|✅︎|✅︎|✅︎|✅︎| | | OpenGVLab/InternVL3_5-30B-A3B |✅︎|✅︎|✅︎|✅︎|✅︎| | -| THUDM/GLM-4v-9B |✅︎|✅︎|✅︎|✅︎| | | | openbmb/MiniCPM-V-4 |✅︎|✅︎|✅︎|✅︎|✅︎| | ### Pooling Models @@ -147,13 +99,15 @@ These models primarily support the LLM.embed API. The following table lists thos ## 3. Limitations -Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray and MLA(Multi-head Latent Attention). +Some vLLM features still require additional enablement or refinement and are not included in current release, like LoRA (Low-Rank Adaptation), pipeline parallelism on Ray, and MLA (Multi-head Latent Attention). CPU KV-cache offloading also needs further refinement due to kernel migration. -The following are known issues: +The following items are also known issues: -* Qwen/Qwen3-30B-A3B FP16/BF16 needs set `PYTORCH_ALLOC_CONF=expandable_segments:True` or `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to leverage expandable blocks in cache allocator. -* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic. -* The gpt-oss models have performance drop due to the UMD changes. The current UMD driver disabled LSC compression for all accesses to global memory because PCIE remote access doesn't support it. +* Certain workloads may show lower performance than the 0.14.1 release, as this release focuses on establishing a solid functional baseline with vLLM XPU kernels and removing IPEX dependencies. Performance optimizations will continue in future releases. +* Set the `SYCL_UR_USE_LEVEL_ZERO_V2=0` environment variable to avoid unexpected OOM errors during inference. +* Set block size to `64` for better accuracy. +* For `Qwen/Qwen3-30B-A3B` in FP16/BF16, set `PYTORCH_ALLOC_CONF=expandable_segments:True` or `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to enable expandable blocks in the cache allocator. +* W8A8 quantized models generated with `llm_compressor` are not supported yet, such as `RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic`. ## 4. How to Get Started @@ -165,13 +119,25 @@ The following are known issues: ### 4.2. Prepare a Serving Environment -1. Get the released docker image with command `docker pull intel/vllm:0.14.1-xpu` -2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.14.1-xpu /bin/bash` -3. Run command `docker exec -it vllm-test bash` in 2 separate terminals to enter container environments for the server and the client respectively. +1. Pull the released Docker image: + + ```bash + docker pull intel/vllm:0.17.0-xpu + ``` + +2. Start a container: -\* Starting from here, all commands are expected to be run inside the docker container, if not explicitly noted. + ```bash + docker run -t -d --shm-size 10g --net=host --ipc=host --privileged \ + -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test \ + --device /dev/dri:/dev/dri --entrypoint=/bin/bash intel/vllm:0.17.0-xpu + ``` -In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` environment variable to make sure necessary files can be downloaded from the HuggingFace website. +3. Open two terminals and run `docker exec -it vllm-test bash` in both of them. Use one terminal for the server and the other for the client. + +From this point on, all commands are expected to be run inside the Docker container unless noted otherwise. + +In both environments, you may want to set the `HUGGING_FACE_HUB_TOKEN` environment variable to ensure that required files can be downloaded from Hugging Face. ```bash export HUGGING_FACE_HUB_TOKEN=xxxxxx @@ -193,57 +159,66 @@ VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve deepseek-ai/DeepSeek-R1-Distill-Qw --no-enable-prefix-caching \ --trust-remote-code \ --disable-sliding-window \ - --disable-log-requests \ --max-num-batched-tokens=8192 \ --max-model-len 4096 \ -tp=4 \ --quantization fp8 ``` -Note that by default fp8 online quantization will use `e5m2` and you can switch to use `e4m3` by explicitly add env `VLLM_XPU_FP8_DTYPE=e4m3`. If there is not enough memory to hold the whole model before quantization to fp8, you can use `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` to offload weights to CPU first. - Expected output: ```bash -INFO 02-20 03:20:29 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000 -INFO 02-20 03:20:29 launcher.py:23] Available routes are: -INFO 02-20 03:20:29 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET -INFO 02-20 03:20:29 launcher.py:31] Route: /docs, Methods: HEAD, GET -INFO 02-20 03:20:29 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET -INFO 02-20 03:20:29 launcher.py:31] Route: /redoc, Methods: HEAD, GET -INFO 02-20 03:20:29 launcher.py:31] Route: /health, Methods: GET -INFO 02-20 03:20:29 launcher.py:31] Route: /ping, Methods: POST, GET -INFO 02-20 03:20:29 launcher.py:31] Route: /tokenize, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /detokenize, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/models, Methods: GET -INFO 02-20 03:20:29 launcher.py:31] Route: /version, Methods: GET -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/chat/completions, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/completions, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/embeddings, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /pooling, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /score, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/score, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /rerank, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v1/rerank, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /v2/rerank, Methods: POST -INFO 02-20 03:20:29 launcher.py:31] Route: /invocations, Methods: POST +INFO 03-20 03:20:29 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000 +INFO 03-20 03:20:29 launcher.py:23] Available routes are: +INFO 03-20 03:20:29 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET +INFO 03-20 03:20:29 launcher.py:31] Route: /docs, Methods: HEAD, GET +INFO 03-20 03:20:29 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET +INFO 03-20 03:20:29 launcher.py:31] Route: /redoc, Methods: HEAD, GET +INFO 03-20 03:20:29 launcher.py:31] Route: /health, Methods: GET +INFO 03-20 03:20:29 launcher.py:31] Route: /ping, Methods: POST, GET +INFO 03-20 03:20:29 launcher.py:31] Route: /tokenize, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /detokenize, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/models, Methods: GET +INFO 03-20 03:20:29 launcher.py:31] Route: /version, Methods: GET +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/chat/completions, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/completions, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/embeddings, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /pooling, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /score, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/score, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /rerank, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v1/rerank, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /v2/rerank, Methods: POST +INFO 03-20 03:20:29 launcher.py:31] Route: /invocations, Methods: POST INFO: Started server process [1636943] INFO: Waiting for application startup. INFO: Application startup complete. ``` -It may take some time. Showing `INFO: Application startup complete.` indicates that the server is ready. +Startup may take some time. When `INFO: Application startup complete.` appears, the server is ready. #### 4.3.2. Raise Requests for Benchmarking in the Client Environment -Use the command below to shoot serving requests: +Use the following command to send benchmark requests: ```bash -vllm bench serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset-name random --random-input-len=1024 --random-output-len=1024 --ignore-eos --num-prompt 1 --max-concurrency 16 --request-rate inf --backend vllm --port=8000 --host 0.0.0.0 --ready-check-timeout-sec 1 +vllm bench serve \ + --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \ + --dataset-name random \ + --random-input-len=1024 \ + --random-output-len=1024 \ + --ignore-eos \ + --num-prompt 16 \ + --max-concurrency 16 \ + --request-rate inf \ + --backend vllm \ + --port=8000 \ + --host 0.0.0.0 \ + --ready-check-timeout-sec 1 ``` -The command uses model `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. Both input and output token sizes are set to `1024`. Maximally `16` requests are processed concurrently in the server. +This command uses the `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` model. Both the input and output token lengths are set to `1024`, and up to `16` requests are processed concurrently by the server. Expected output: