Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ KTransformers is a research project focused on efficient inference and fine-tuni
* **Dec 22, 2025**: Support RL-DPO fine-tuning with LLaMA-Factory. ([Tutorial](./doc/en/SFT/DPO_tutorial.md))
* **Dec 5, 2025**: Support Native Kimi-K2-Thinking inference ([Tutorial](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md))
* **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md))
* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
* **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md))
* **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
* **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425), [Blog](https://lmsys.org/blog/2025-10-22-KTransformers/))
* **Sept 11, 2025**: Support Qwen3-Next. ([Tutorial](./doc/en/Qwen3-Next.md))
Expand Down Expand Up @@ -87,7 +87,7 @@ pip install .

---

### 🎓 [kt-sft](./kt-sft/) - Fine-Tuning Framework
### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - Fine-Tuning Framework

KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.

Expand All @@ -109,12 +109,15 @@ KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning

**Quick Start:**
```bash
cd kt-sft
# Install environment following kt-sft/README.md
USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
cd /path/to/LLaMA-Factory
pip install -e .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Including the [torch,metrics] extras when installing LLaMA-Factory is recommended to ensure that all necessary dependencies for training and evaluation are installed, especially since the quick start does not explicitly install them earlier.

Suggested change
pip install -e .
pip install -e ".[torch,metrics]"

pip install "ktransformers[sft]"
USE_KT=1 ACCELERATE_USE_KT=true \
accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \
-m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml
```

👉 **[Full Documentation →](./kt-sft/README.md)**
👉 **[Full Documentation →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)**

---

Expand Down
19 changes: 11 additions & 8 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@

## 🎯 概览

KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**:[kt-kernel](./kt-kernel/) 和 [kt-sft](./kt-sft/)。
KTransformers 是一个专注于通过 CPU-GPU 异构计算实现大语言模型高效推理和微调的研究项目。该项目已发展为**两个核心模块**:[kt-kernel](./kt-kernel/) 和 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)。

## 🔥 更新

* **2025 年 12 月 5 日**:支持原生 Kimi-K2-Thinking 推理([教程](./doc/en/Kimi-K2-Thinking-Native.md))
* **2025 年 12 月 5 日**:支持原生 Kimi-K2-Thinking 推理([教程](./doc/en/kt-kernel/Kimi-K2-Thinking-Native.md))
* **2025 年 11 月 6 日**:支持 Kimi-K2-Thinking 推理([教程](./doc/en/Kimi-K2-Thinking.md))和微调([教程](./doc/en/SFT_Installation_Guide_KimiK2.md))
* **2025 年 11 月 4 日**:KTransformers 微调 × LLaMA-Factory 集成([教程](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
* **2025 年 11 月 4 日**:KTransformers 微调 × LLaMA-Factory 集成([教程](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md))
* **2025 年 10 月 27 日**:支持昇腾 NPU([教程](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
* **2025 年 10 月 10 日**:集成到 SGLang([路线图](https://github.com/sgl-project/sglang/issues/11425),[博客](https://lmsys.org/blog/2025-10-22-KTransformers/))
* **2025 年 9 月 11 日**:支持 Qwen3-Next([教程](./doc/en/Qwen3-Next.md))
Expand Down Expand Up @@ -79,7 +79,7 @@ pip install .

---

### 🎓 [kt-sft](./kt-sft/) - 微调框架
### 🎓 [kt-sft](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md) - 微调框架

KTransformers × LLaMA-Factory 集成,用于超大型 MoE 模型微调。

Expand All @@ -101,12 +101,15 @@ KTransformers × LLaMA-Factory 集成,用于超大型 MoE 模型微调。

**快速开始:**
```bash
cd kt-sft
# 按照 kt-sft/README.md 安装环境
USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml
cd /path/to/LLaMA-Factory
pip install -e .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

建议在安装 LLaMA-Factory 时包含 [torch,metrics] 扩展项,以确保安装了训练和评估所需的所有依赖项。

Suggested change
pip install -e .
pip install -e ".[torch,metrics]"

pip install "ktransformers[sft]"
USE_KT=1 ACCELERATE_USE_KT=true \
accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml \
-m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml
```

👉 **[完整文档 →](./kt-sft/README.md)**
👉 **[完整文档 →](./doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md)**

---

Expand Down
2 changes: 1 addition & 1 deletion doc/en/Kimi-K2.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ cd kt-kernel && ./install.sh
./install.sh

# Option B: pip install
pip install sglang-kt
pip install kt-kernel sglang-kt
```

> Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`
Expand Down
2 changes: 1 addition & 1 deletion doc/en/MiniMax-M2.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ cd kt-kernel && ./install.sh
./install.sh

# Option B: pip install
pip install sglang-kt
pip install kt-kernel sglang-kt
```

> Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`
Expand Down
2 changes: 1 addition & 1 deletion doc/en/Qwen3.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ cd kt-kernel && ./install.sh
./install.sh

# Option B: pip install
pip install sglang-kt
pip install kt-kernel sglang-kt
```

> Note: You may need to reinstall cudnn: `pip install nvidia-cudnn-cu12==9.16.0.29`
Expand Down
20 changes: 10 additions & 10 deletions doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,26 +95,26 @@ This section shows how to install and use **LLaMA-Factory + KTransformers** for
### Environment Setup

According to the following example, install both the **KTransformers** and **LLaMA-Factory** environments simultaneously.
This time, to simplify the installation process of KTransformers, we have specially packaged a wheel file to avoid local compilation.
This time, to simplify the installation process of KTransformers, use the PyPI packages to avoid local compilation.
The detailed installation steps are as follows:
(Note: Make sure your local **Python version**, **Torch version**, **CUDA version**, and the **KTransformers wheel filename** correspond correctly.)
(Note: Make sure your local **Python version**, **Torch version**, and **CUDA version** are compatible with the installed packages.)

```shell
# 1. Create a conda environment
conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12, 3.13]
conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The suggested Python versions [3.11, 3.12, 3.13] include 3.13, but kt-kernel/README.md (lines 63 and 68) indicates that pre-built wheels are currently only provided for Python 3.10, 3.11, and 3.12. Suggesting 3.13 may lead to a source build, which contradicts the goal of avoiding local compilation stated in line 98. Additionally, Python 3.10 is missing from the list despite being supported.

Suggested change
conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12]

conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime

# 2. Install the LLaMA-Factory environment
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install -e .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the [torch,metrics] extras from the LLaMA-Factory installation may cause the training and evaluation steps to fail if these dependencies are not already present in the environment. It is safer to include them to ensure the tutorial works as expected.

Suggested change
pip install -e .
pip install -e ".[torch,metrics]"


# 3. Install the KTransformers wheel that matches your Torch and Python versions, from https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1 (Note: The CUDA version can differ from that in the wheel filename.)
pip install ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl
# 3. Install the KTransformers SFT packages
pip install "ktransformers[sft]"

# 4. Install flash-attention, download the corresponding file based on your Python and Torch versions from: https://github.com/Dao-AILab/flash-attention/releases
pip install flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install flash-attn --no-build-isolation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command has been updated to use PyPI, but the preceding comment (line 116) still instructs the user to download a wheel file from GitHub. Please update the comment to be consistent with the new installation method.

# abi=True/False can find from below
# import torch
# print(torch._C._GLIBCXX_USE_CXX11_ABI)
Expand All @@ -128,7 +128,7 @@ pip install custom_flashinfer/

### Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models

Run the command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`.
Run the command: `USE_KT=1 ACCELERATE_USE_KT=true accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml`.

Note: You **must** provide a **BF16** model. DeepSeek-V3-671B is released in FP8 by default; convert with [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py).

Expand Down Expand Up @@ -213,7 +213,7 @@ Outputs go to `output_dir` in safetensors format plus adapter metadata for later

### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)

Run the command: `llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`.
Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command uses qwen3_lora_sft.yaml, but the tutorial is focused on DeepSeek-V3 (as seen in the training step at line 131 and the YAML example at line 221). This inconsistency will confuse users who just trained a DeepSeek-V3 model in the previous step. Please update the command to use the appropriate DeepSeek-V3 inference configuration.

Suggested change
Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
Run the command: `llamafactory-cli chat examples/inference/deepseek_v3_lora_sft.yaml`.


Use the safetensors adapter trained with KT for inference.

Expand All @@ -238,7 +238,7 @@ During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see lo

### Core Feature 3: Batch inference + metrics (base + LoRA adapter)

Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`.
Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This API command uses qwen3_lora_sft.yaml, which is inconsistent with the DeepSeek-V3 model used throughout the rest of the guide.

Suggested change
Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek_v3_lora_sft.yaml`.

Invoke the KT fine-tuned adapter to provide the API; the usage logic of other APIs is consistent with the native LLaMA-Factory approach.

```yaml
Expand Down
2 changes: 1 addition & 1 deletion doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This tutorial demonstrates how to run DeepSeek V3.2 model inference using SGLang
Before starting, ensure you have:

1. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)
2. **SGLang installed** - Install the kvcache-ai fork: `pip install sglang-kt` or run `./install.sh` from the ktransformers root
2. **SGLang installed** - Install the kvcache-ai fork: `pip install kt-kernel sglang-kt` or run `./install.sh` from the ktransformers root
3. **CUDA toolkit** - Compatible with your GPU (CUDA 11.8+ recommended)
4. **Hugging Face CLI** - For downloading models:
```bash
Expand Down
6 changes: 3 additions & 3 deletions kt-kernel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ pip install kt-kernel
For NVIDIA GPU-accelerated inference:

```bash
pip install kt-kernel-cuda
pip install kt-kernel
```

**Features:**
Expand Down Expand Up @@ -269,7 +269,7 @@ Install the kvcache-ai fork of SGLang (required for kt-kernel support):
./install.sh

# Option B: pip install
pip install sglang-kt
pip install kt-kernel sglang-kt

# Option C: From source (editable mode)
git clone --recursive https://github.com/kvcache-ai/ktransformers.git
Expand Down Expand Up @@ -512,7 +512,7 @@ python -m sglang.launch_server \
- **`kt-method`**: Choose based on your CPU and weight format:
- `AMXINT4`: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
- `AMXINT8`: Higher accuracy with INT8 quantized weights on AMX CPUs
- `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/Kimi-K2-Thinking-Native.md) for details.
- `RAWINT4`: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/kt-kernel/Kimi-K2-Thinking-Native.md) for details.
- `FP8`, `FP8_PERCHANNEL`: FP8 weights shared by CPU and GPU
- `BF16`: BF16 weights shared by CPU and GPU
- `LLAMAFILE`: GGUF-based backend
Expand Down
Loading