kvcache-ai · JimmyPeilinLi · Apr 26, 2026 · Apr 26, 2026 · gemini-code-assist · Apr 26, 2026
diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ KTransformers is a research project focused on efficient inference and fine-tuni
 * **Dec 22, 2025**: Support RL-DPO fine-tuning with LLaMA-Factory. ([Tutorial](./doc/en/SFT/DPO_tutorial.md))
 * **Dec 5, 2025**: Support Native Kimi-K2-Thinking inference ([Tutorial](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md))
 * **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md))
-* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
+* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md))
 * **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
 * **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425), [Blog](https://lmsys.org/blog/2025-10-22-KTransformers/))
 * **Sept 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
@@ -87,7 +87,7 @@ pip install .
 
 ---
 
-### 🎓 [kt-sft](./kt-sft/) - Fine-Tuning Framework
+### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - Fine-Tuning Framework
 
 KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.
 
@@ -109,12 +109,15 @@ KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning
 
 **Quick Start:**
 ```bash
-cd kt-sft
-# Install environment following kt-sft/README.md
-USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
+cd /path/to/LLaMA-Factory
+pip install -e .
-pip install -e .
+pip install -e ".[torch,metrics]"
-pip install -e .
+pip install -e ".[torch,metrics]"
+pip install "ktransformers[sft]"
+USE_KT=1 ACCELERATE_USE_KT=true \
+  accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \
+  -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml
 ```
 
-👉 **[Full Documentation →](./kt-sft/README.md)**
+👉 **[Full Documentation →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)**
 
 ---
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -13,13 +13,13 @@
 
 ## 🎯 概览
 
-KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**：[kt-kernel](./kt-kernel/) 和 [kt-sft](./kt-sft/)。
+KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**：[kt-kernel](./kt-kernel/) 和 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)。
 
 ## 🔥 更新
 
-* **2025 年 12 月 5 日**：支持原生 Kimi-K2-Thinking 推理（[教程](./doc/en/Kimi-K2-Thinking-Native.md)）
+* **2025 年 12 月 5 日**：支持原生 Kimi-K2-Thinking 推理（[教程](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md)）
 * **2025 年 11 月 6 日**：支持 Kimi-K2-Thinking 推理（[教程](./doc/en/Kimi-K2-Thinking.md)）和微调（[教程](./doc/en/SFT_Installation_Guide_KimiK2.md)）
-* **2025 年 11 月 4 日**：KTransformers 微调 × LLaMA-Factory 集成（[教程](./doc/en/KTransformers-Fine-Tuning_User-Guide.md)）
+* **2025 年 11 月 4 日**：KTransformers 微调 × LLaMA-Factory 集成（[教程](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)）
 * **2025 年 10 月 27 日**：支持昇腾 NPU（[教程](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md)）
 * **2025 年 10 月 10 日**：集成到 SGLang（[路线图](https://github.com/sgl-project/sglang/issues/11425)，[博客](https://lmsys.org/blog/2025-10-22-KTransformers/)）
 * **2025 年 9 月 11 日**：支持 Qwen3-Next（[教程](./doc/en/Qwen3-Next.md)）
@@ -79,7 +79,7 @@ pip install .
 
 ---
 
-### 🎓 [kt-sft](./kt-sft/) - 微调框架
+### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - 微调框架
 
 KTransformers × LLaMA-Factory 集成，用于超大型 MoE 模型微调。
 
@@ -101,12 +101,15 @@ KTransformers × LLaMA-Factory 集成，用于超大型 MoE 模型微调。
 
 **快速开始：**
 ```bash
-cd kt-sft
-# 按照 kt-sft/README.md 安装环境
-USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
+cd /path/to/LLaMA-Factory
+pip install -e .
-pip install -e .
+pip install -e ".[torch,metrics]"
-pip install -e .
+pip install -e ".[torch,metrics]"
+pip install "ktransformers[sft]"
+USE_KT=1 ACCELERATE_USE_KT=true \
+  accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \
+  -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml
 ```
 
-👉 **[完整文档 →](./kt-sft/README.md)**
+👉 **[完整文档 →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)**
 
 ---
 

diff --git a/doc/en/Kimi-K2.5.md b/doc/en/Kimi-K2.5.md
@@ -39,7 +39,7 @@ cd kt-kernel && ./install.sh
 ./install.sh
 
 # Option B: pip install
-pip install sglang-kt
+pip install kt-kernel sglang-kt
 ```
 
 > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`

diff --git a/doc/en/MiniMax-M2.5.md b/doc/en/MiniMax-M2.5.md
@@ -37,7 +37,7 @@ cd kt-kernel && ./install.sh
 ./install.sh
 
 # Option B: pip install
-pip install sglang-kt
+pip install kt-kernel sglang-kt
 ```
 
 > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`

diff --git a/doc/en/Qwen3.5.md b/doc/en/Qwen3.5.md
@@ -43,7 +43,7 @@ cd kt-kernel && ./install.sh
 ./install.sh
 
 # Option B: pip install
-pip install sglang-kt
+pip install kt-kernel sglang-kt
 ```
 
 > Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`

diff --git a/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md b/doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md
@@ -95,26 +95,26 @@ This section shows how to install and use **LLaMA-Factory + KTransformers** for
 ### Environment Setup
 
 According to the following example, install both the **KTransformers** and **LLaMA-Factory** environments simultaneously.
- This time, to simplify the installation process of KTransformers, we have specially packaged a wheel file to avoid local compilation.
+ This time, to simplify the installation process of KTransformers, use the PyPI packages to avoid local compilation.
  The detailed installation steps are as follows:
- (Note: Make sure your local **Python version**, **Torch version**, **CUDA version**, and the **KTransformers wheel filename** correspond correctly.)
+ (Note: Make sure your local **Python version**, **Torch version**, and **CUDA version** are compatible with the installed packages.)
 
 ```shell
 # 1. Create a conda environment
-conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12, 3.13]
+conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
-conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
+conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12]
-conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
+conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12]
 conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
 conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
 
 # 2. Install the LLaMA-Factory environment
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
-pip install -e ".[torch,metrics]" --no-build-isolation
+pip install -e .
-pip install -e .
+pip install -e ".[torch,metrics]"
-pip install -e .
+pip install -e ".[torch,metrics]"
 
-# 3. Install the KTransformers wheel that matches your Torch and Python versions, from https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1 (Note: The CUDA version can differ from that in the wheel filename.)
-pip install ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl
+# 3. Install the KTransformers SFT packages
+pip install "ktransformers[sft]"
 
 # 4. Install flash-attention, download the corresponding file based on your Python and Torch versions from: https://github.com/Dao-AILab/flash-attention/releases
-pip install flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
+pip install flash-attn --no-build-isolation
 # abi=True/False can find from below
 # import torch
 # print(torch._C._GLIBCXX_USE_CXX11_ABI)
@@ -128,7 +128,7 @@ pip install custom_flashinfer/
 
 ### Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models
 
-Run the command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`.
+Run the command: `USE_KT=1 ACCELERATE_USE_KT=true accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml`.
 
 Note: You **must** provide a **BF16** model. DeepSeek-V3-671B is released in FP8 by default; convert with [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py).
 
@@ -213,7 +213,7 @@ Outputs go to `output_dir` in safetensors format plus adapter metadata for later
 
 ### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
 
-Run the command: `llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`.
+Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
-Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
+Run the command: `llamafactory-cli chat examples/inference/deepseek_v3_lora_sft.yaml`.
-Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
+Run the command: `llamafactory-cli chat examples/inference/deepseek_v3_lora_sft.yaml`.
 
 Use the safetensors adapter trained with KT for inference.
 
@@ -238,7 +238,7 @@ During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see lo
 
 ### Core Feature 3: Batch inference + metrics (base + LoRA adapter)
 
-Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`.
+Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
-Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
+Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek_v3_lora_sft.yaml`.
-Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
+Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek_v3_lora_sft.yaml`.
  Invoke the KT fine-tuned adapter to provide the API; the usage logic of other APIs is consistent with the native LLaMA-Factory approach.
 
 ```yaml

diff --git a/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md b/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md
@@ -30,7 +30,7 @@ This tutorial demonstrates how to run DeepSeek V3.2 model inference using SGLang
 Before starting, ensure you have:
 
 1. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)
-2. **SGLang installed** - Install the kvcache-ai fork: `pip install sglang-kt` or run `./install.sh` from the ktransformers root
+2. **SGLang installed** - Install the kvcache-ai fork: `pip install kt-kernel sglang-kt` or run `./install.sh` from the ktransformers root
 3. **CUDA toolkit** - Compatible with your GPU (CUDA 11.8+ recommended)
 4. **Hugging Face CLI** - For downloading models:
    ```bash

diff --git a/kt-kernel/README.md b/kt-kernel/README.md
@@ -75,7 +75,7 @@ pip install kt-kernel
 For NVIDIA GPU-accelerated inference:
 
 ```bash
-pip install kt-kernel-cuda
+pip install kt-kernel
 ```
 
 **Features:**
@@ -269,7 +269,7 @@ Install the kvcache-ai fork of SGLang (required for kt-kernel support):
 ./install.sh
 
 # Option B: pip install
-pip install sglang-kt
+pip install kt-kernel sglang-kt
 
 # Option C: From source (editable mode)
 git clone --recursive https://github.com/kvcache-ai/ktransformers.git
@@ -512,7 +512,7 @@ python -m sglang.launch_server \
 - **`kt-method`**: Choose based on your CPU and weight format:
   - `AMXINT4`: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
   - `AMXINT8`: Higher accuracy with INT8 quantized weights on AMX CPUs
-  - `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/Kimi-K2-Thinking-Native.md) for details.
+  - `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/kt-kernel/Kimi-K2-Thinking-Native.md) for details.
   - `FP8`, `FP8_PERCHANNEL`: FP8 weights shared by CPU and GPU
   - `BF16`: BF16 weights shared by CPU and GPU
   - `LLAMAFILE`: GGUF-based backend