diff --git a/README.md b/README.md index 6f45a32e4..b29eb3d43 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ KTransformers is a research project focused on efficient inference and fine-tuni * **Dec 22, 2025**: Support RL-DPO fine-tuning with LLaMA-Factory. ([Tutorial](./doc/en/SFT/DPO_tutorial.md)) * **Dec 5, 2025**: Support Native Kimi-K2-Thinking inference ([Tutorial](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md)) * **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md)) -* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md)) +* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)) * **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md)) * **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425), [Blog](https://lmsys.org/blog/2025-10-22-KTransformers/)) * **Sept 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md)) @@ -87,7 +87,7 @@ pip install . --- -### 🎓 [kt-sft](./kt-sft/) - Fine-Tuning Framework +### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - Fine-Tuning Framework KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning. @@ -109,12 +109,15 @@ KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning **Quick Start:** ```bash -cd kt-sft -# Install environment following kt-sft/README.md -USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml +cd /path/to/LLaMA-Factory +pip install -e . +pip install "ktransformers[sft]" +USE_KT=1 ACCELERATE_USE_KT=true \ + accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \ + -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml ``` -👉 **[Full Documentation →](./kt-sft/README.md)** +👉 **[Full Documentation →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)** --- diff --git a/README_ZH.md b/README_ZH.md index e60183aee..e1f73c857 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -13,13 +13,13 @@ ## 🎯 概览 -KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**:[kt-kernel](./kt-kernel/) 和 [kt-sft](./kt-sft/)。 +KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**:[kt-kernel](./kt-kernel/) 和 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)。 ## 🔥 更新 -* **2025 年 12 月 5 日**:支持原生 Kimi-K2-Thinking 推理([教程](./doc/en/Kimi-K2-Thinking-Native.md)) +* **2025 年 12 月 5 日**:支持原生 Kimi-K2-Thinking 推理([教程](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md)) * **2025 年 11 月 6 日**:支持 Kimi-K2-Thinking 推理([教程](./doc/en/Kimi-K2-Thinking.md))和微调([教程](./doc/en/SFT_Installation_Guide_KimiK2.md)) -* **2025 年 11 月 4 日**:KTransformers 微调 × LLaMA-Factory 集成([教程](./doc/en/KTransformers-Fine-Tuning_User-Guide.md)) +* **2025 年 11 月 4 日**:KTransformers 微调 × LLaMA-Factory 集成([教程](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)) * **2025 年 10 月 27 日**:支持昇腾 NPU([教程](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md)) * **2025 年 10 月 10 日**:集成到 SGLang([路线图](https://github.com/sgl-project/sglang/issues/11425),[博客](https://lmsys.org/blog/2025-10-22-KTransformers/)) * **2025 年 9 月 11 日**:支持 Qwen3-Next([教程](./doc/en/Qwen3-Next.md)) @@ -79,7 +79,7 @@ pip install . --- -### 🎓 [kt-sft](./kt-sft/) - 微调框架 +### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - 微调框架 KTransformers × LLaMA-Factory 集成,用于超大型 MoE 模型微调。 @@ -101,12 +101,15 @@ KTransformers × LLaMA-Factory 集成,用于超大型 MoE 模型微调。 **快速开始:** ```bash -cd kt-sft -# 按照 kt-sft/README.md 安装环境 -USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml +cd /path/to/LLaMA-Factory +pip install -e . +pip install "ktransformers[sft]" +USE_KT=1 ACCELERATE_USE_KT=true \ + accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \ + -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml ``` -👉 **[完整文档 →](./kt-sft/README.md)** +👉 **[完整文档 →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)** --- diff --git a/doc/en/Kimi-K2.5.md b/doc/en/Kimi-K2.5.md index f75017f2f..364055f5d 100644 --- a/doc/en/Kimi-K2.5.md +++ b/doc/en/Kimi-K2.5.md @@ -39,7 +39,7 @@ cd kt-kernel && ./install.sh ./install.sh # Option B: pip install -pip install sglang-kt +pip install kt-kernel sglang-kt ``` > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29` diff --git a/doc/en/MiniMax-M2.5.md b/doc/en/MiniMax-M2.5.md index fc5c7d14c..e7d7bdaee 100644 --- a/doc/en/MiniMax-M2.5.md +++ b/doc/en/MiniMax-M2.5.md @@ -37,7 +37,7 @@ cd kt-kernel && ./install.sh ./install.sh # Option B: pip install -pip install sglang-kt +pip install kt-kernel sglang-kt ``` > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29` diff --git a/doc/en/Qwen3.5.md b/doc/en/Qwen3.5.md index 2ddbaf5a0..59b27998e 100644 --- a/doc/en/Qwen3.5.md +++ b/doc/en/Qwen3.5.md @@ -43,7 +43,7 @@ cd kt-kernel && ./install.sh ./install.sh # Option B: pip install -pip install sglang-kt +pip install kt-kernel sglang-kt ``` > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29` diff --git a/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md b/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md index 298e2aa30..7913021e4 100644 --- a/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md +++ b/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md @@ -95,26 +95,26 @@ This section shows how to install and use **LLaMA-Factory + KTransformers** for ### Environment Setup According to the following example, install both the **KTransformers** and **LLaMA-Factory** environments simultaneously. - This time, to simplify the installation process of KTransformers, we have specially packaged a wheel file to avoid local compilation. + This time, to simplify the installation process of KTransformers, use the PyPI packages to avoid local compilation. The detailed installation steps are as follows: - (Note: Make sure your local **Python version**, **Torch version**, **CUDA version**, and the **KTransformers wheel filename** correspond correctly.) + (Note: Make sure your local **Python version**, **Torch version**, and **CUDA version** are compatible with the installed packages.) ```shell # 1. Create a conda environment -conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12, 3.13] +conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13] conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64 conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime # 2. Install the LLaMA-Factory environment git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory -pip install -e ".[torch,metrics]" --no-build-isolation +pip install -e . -# 3. Install the KTransformers wheel that matches your Torch and Python versions, from https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1 (Note: The CUDA version can differ from that in the wheel filename.) -pip install ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl +# 3. Install the KTransformers SFT packages +pip install "ktransformers[sft]" # 4. Install flash-attention, download the corresponding file based on your Python and Torch versions from: https://github.com/Dao-AILab/flash-attention/releases -pip install flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl +pip install flash-attn --no-build-isolation # abi=True/False can find from below # import torch # print(torch._C._GLIBCXX_USE_CXX11_ABI) @@ -128,7 +128,7 @@ pip install custom_flashinfer/ ### Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models -Run the command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`. +Run the command: `USE_KT=1 ACCELERATE_USE_KT=true accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml`. Note: You **must** provide a **BF16** model. DeepSeek-V3-671B is released in FP8 by default; convert with [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py). @@ -213,7 +213,7 @@ Outputs go to `output_dir` in safetensors format plus adapter metadata for later ### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter) -Run the command: `llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`. +Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`. Use the safetensors adapter trained with KT for inference. @@ -238,7 +238,7 @@ During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see lo ### Core Feature 3: Batch inference + metrics (base + LoRA adapter) -Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`. +Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`. Invoke the KT fine-tuned adapter to provide the API; the usage logic of other APIs is consistent with the native LLaMA-Factory approach. ```yaml diff --git a/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md b/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md index 9cffe0d79..4d6851e8b 100644 --- a/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md +++ b/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md @@ -30,7 +30,7 @@ This tutorial demonstrates how to run DeepSeek V3.2 model inference using SGLang Before starting, ensure you have: 1. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation) -2. **SGLang installed** - Install the kvcache-ai fork: `pip install sglang-kt` or run `./install.sh` from the ktransformers root +2. **SGLang installed** - Install the kvcache-ai fork: `pip install kt-kernel sglang-kt` or run `./install.sh` from the ktransformers root 3. **CUDA toolkit** - Compatible with your GPU (CUDA 11.8+ recommended) 4. **Hugging Face CLI** - For downloading models: ```bash diff --git a/kt-kernel/README.md b/kt-kernel/README.md index a04d9b601..24e3cb21c 100644 --- a/kt-kernel/README.md +++ b/kt-kernel/README.md @@ -75,7 +75,7 @@ pip install kt-kernel For NVIDIA GPU-accelerated inference: ```bash -pip install kt-kernel-cuda +pip install kt-kernel ``` **Features:** @@ -269,7 +269,7 @@ Install the kvcache-ai fork of SGLang (required for kt-kernel support): ./install.sh # Option B: pip install -pip install sglang-kt +pip install kt-kernel sglang-kt # Option C: From source (editable mode) git clone --recursive https://github.com/kvcache-ai/ktransformers.git @@ -512,7 +512,7 @@ python -m sglang.launch_server \ - **`kt-method`**: Choose based on your CPU and weight format: - `AMXINT4`: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B) - `AMXINT8`: Higher accuracy with INT8 quantized weights on AMX CPUs - - `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/Kimi-K2-Thinking-Native.md) for details. + - `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/kt-kernel/Kimi-K2-Thinking-Native.md) for details. - `FP8`, `FP8_PERCHANNEL`: FP8 weights shared by CPU and GPU - `BF16`: BF16 weights shared by CPU and GPU - `LLAMAFILE`: GGUF-based backend