Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 156 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,177 @@
# kokoro
# kokoro_inference

An inference library for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M). You can [`pip install kokoro`](https://pypi.org/project/kokoro/).
Personal fork of `kokoro` focused on inference-time performance experiments for
[Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M).

> **Kokoro** is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
The main change in this fork is that `KModel` prepares the network for
inference by stripping PyTorch `weight_norm` parametrizations after checkpoint
loading. The normalized weights are materialized once with
`leave_parametrized=True`, so inference no longer pays the parametrization
overhead on every forward pass.

### Usage
You can run this basic cell on [Google Colab](https://colab.research.google.com/). [Listen to samples](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/SAMPLES.md).
```py
!pip install -q kokoro>=0.9.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
This is an inference optimization project, not a training fork. The goal is to
keep model behavior compatible with upstream Kokoro while making the loaded
model cheaper to run.

## What Changed

- `KModel` strips `weight_norm` by default during initialization.
- `KModel.prepare_for_inference()` can be called manually and returns the number
of parametrized weights removed.
- `for_training=True` keeps the original training-time parametrizations.
- `local_repo_dir` support lets `KModel` and `KPipeline` load a repo-shaped local
model directory instead of downloading from Hugging Face.

Only `weight_norm` parametrizations are removed. Normalization layers that are
part of the model computation, such as layer norm and instance norm, are left in
place.

## Install

From this checkout:

```bash
pip install -e .
```

Kokoro uses `misaki` for grapheme-to-phoneme conversion. English support is
included through the package dependency. Some languages and fallback paths also
need `espeak-ng` installed on the system.

macOS:

```bash
brew install espeak-ng
```

Debian/Ubuntu:

```bash
sudo apt-get install espeak-ng
```

## Basic Usage

```python
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
print(i, gs, ps)
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000)

pipeline = KPipeline(lang_code="a")

text = "Kokoro is an open-weight text-to-speech model."
generator = pipeline(text, voice="af_heart")

for index, result in enumerate(generator):
if result.audio is None:
continue
sf.write(f"{index}.wav", result.audio, 24000)
```
Under the hood, `kokoro` uses [`misaki`](https://pypi.org/project/misaki/), a G2P library at https://github.com/hexgrad/misaki

### Advanced Usage
You can run this advanced cell on [Google Colab](https://colab.research.google.com/).
```py
# 1️⃣ Install kokoro
!pip install -q kokoro>=0.9.4 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
The default `KPipeline` path creates a `KModel`, moves it to the selected device,
sets it to eval mode, and uses the stripped-weight inference form.

## Manual Inference Preparation

Use this if you want to inspect or control the stripping step directly:

# 3️⃣ Initalize a pipeline
```python
from kokoro import KModel

model = KModel(for_training=True)
removed = model.prepare_for_inference()
model.eval()

print(f"removed {removed} weight_norm parametrizations")
```

For training or fine-tuning experiments, keep the original parametrized modules:

```python
from kokoro import KModel

model = KModel(for_training=True)
```

## Local Model Files

To avoid Hugging Face downloads, provide a local directory containing
`config.json`, the Kokoro checkpoint, and `voices/`:

```python
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.

# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.

[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪,不惹事也不怕事,任何外國不要指望我們會拿自己的核心利益做交易,不要指望我們會吞下損害我國主權、安全、發展利益的苦果!'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'

# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
text, voice='af_heart', # <= change voice here
speed=1, split_pattern=r'\n+'

pipeline = KPipeline(
lang_code="a",
local_repo_dir="./checkpoints/Kokoro-82M",
)
# Alternatively, load voice tensor directly:
# voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
# generator = pipeline(
# text, voice=voice_tensor,
# speed=1, split_pattern=r'\n+'
# )

for i, (gs, ps, audio) in enumerate(generator):
print(i) # i => index
print(gs) # gs => graphemes/text
print(ps) # ps => phonemes
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000) # save each audio file
```

### Windows Installation
To install espeak-ng on Windows:
1. Go to [espeak-ng releases](https://github.com/espeak-ng/espeak-ng/releases)
2. Click on **Latest release**
3. Download the appropriate `*.msi` file (e.g. **espeak-ng-20191129-b702b03-x64.msi**)
4. Run the downloaded installer

For advanced configuration and usage on Windows, see the [official espeak-ng Windows guide](https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md)
The directory is expected to look like a model repo:

### MacOS Apple Silicon GPU Acceleration
```text
checkpoints/Kokoro-82M/
config.json
kokoro-v1_0.pth
voices/
```

On Mac M1/M2/M3/M4 devices, you can explicitly specify the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to enable GPU acceleration.
## CLI

```bash
PYTORCH_ENABLE_MPS_FALLBACK=1 python run-your-kokoro-script.py
python -m kokoro \
--text "The quick brown fox jumps over the lazy dog." \
--voice af_heart \
--output-file out.wav
```

## Optimization Notes

Kokoro inherits StyleTTS-style modules that wrap many convolution layers with
`torch.nn.utils.parametrizations.weight_norm`. That representation is useful
while training because the effective weight is derived from separate magnitude
and direction parameters.

For inference, the effective weight can be computed once after loading the
checkpoint. This fork does that with:

```python
torch.nn.utils.parametrize.remove_parametrizations(
module,
"weight",
leave_parametrized=True,
)
```

### Conda Environment
Use the following conda `environment.yml` if you're facing any dependency issues.
```yaml
name: kokoro
channels:
- defaults
dependencies:
- python==3.9
- libstdcxx~=12.4.0 # Needed to load espeak correctly. Try removing this if you're facing issues with Espeak fallback.
- pip:
- kokoro>=0.3.1
- soundfile
- misaki[en]
Expected impact:

- lower Python and parametrization overhead during repeated forward passes
- simpler module state for export and deployment experiments
- no intentional checkpoint or architecture change

Actual speedup depends on hardware, backend, text length, batching, and whether
the workload is dominated by model execution or text processing. Benchmark on
your deployment target before treating this as a production optimization.

## Development

Run the test suite:

```bash
pytest
```

### Acknowledgements
- 🛠️ [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2.
- 🏆 [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena.
- 📊 Thank you to everyone who contributed synthetic training data.
- ❤️ Special thanks to all compute sponsors.
- 👾 Discord server: https://discord.gg/QuGxSWBfQy
- 🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also a [character in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro) along with [Misaki](https://github.com/hexgrad/misaki?tab=readme-ov-file#acknowledgements).
Useful files:

- `kokoro/model.py`: model loading and inference preparation
- `kokoro/modules.py`: text encoder and predictor modules
- `kokoro/istftnet.py`: decoder modules using `weight_norm`
- `examples/device_examples.py`: simple device timing example
- `examples/export.py`: ONNX export experiment

## Upstream

This repository is a personal inference-focused fork of upstream Kokoro. For the
original project, model cards, voices, and samples, see:

<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
- https://github.com/hexgrad/kokoro
- https://huggingface.co/hexgrad/Kokoro-82M
52 changes: 47 additions & 5 deletions kokoro/model.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,39 @@
from .istftnet import Decoder
from .modules import CustomAlbert, ProsodyPredictor, TextEncoder
from dataclasses import dataclass
from huggingface_hub import hf_hub_download
from loguru import logger
from torch.nn.utils import parametrize
from transformers import AlbertConfig
from typing import Dict, Optional, Union
import json
import os
import torch


def resolve_repo_file(
repo_id: str,
filename: str,
local_repo_dir: Optional[str] = None,
) -> str:
if local_repo_dir is not None:
local_repo_dir = os.path.expanduser(local_repo_dir)
path = os.path.join(local_repo_dir, filename)
if not os.path.isfile(path):
raise FileNotFoundError(path)
return path
from huggingface_hub import hf_hub_download
return hf_hub_download(repo_id=repo_id, filename=filename)


def strip_weight_norm(module: torch.nn.Module) -> int:
removed = 0
for child in module.modules():
if parametrize.is_parametrized(child, "weight"):
parametrize.remove_parametrizations(child, "weight", leave_parametrized=True)
removed += 1
return removed


class KModel(torch.nn.Module):
'''
KModel is a torch.nn.Module with 2 main responsibilities:
Expand All @@ -33,7 +59,9 @@ def __init__(
repo_id: Optional[str] = None,
config: Union[Dict, str, None] = None,
model: Optional[str] = None,
disable_complex: bool = False
disable_complex: bool = False,
local_repo_dir: Optional[str] = None,
for_training: bool = False,
):
super().__init__()
if repo_id is None:
Expand All @@ -42,8 +70,11 @@ def __init__(
self.repo_id = repo_id
if not isinstance(config, dict):
if not config:
logger.debug("No config provided, downloading from HF")
config = hf_hub_download(repo_id=repo_id, filename='config.json')
config = resolve_repo_file(
repo_id=repo_id,
filename='config.json',
local_repo_dir=local_repo_dir,
)
with open(config, 'r', encoding='utf-8') as r:
config = json.load(r)
logger.debug(f"Loaded config: {config}")
Expand All @@ -64,7 +95,11 @@ def __init__(
dim_out=config['n_mels'], disable_complex=disable_complex, **config['istftnet']
)
if not model:
model = hf_hub_download(repo_id=repo_id, filename=KModel.MODEL_NAMES[repo_id])
model = resolve_repo_file(
repo_id=repo_id,
filename=KModel.MODEL_NAMES[repo_id],
local_repo_dir=local_repo_dir,
)
for key, state_dict in torch.load(model, map_location='cpu', weights_only=True).items():
assert hasattr(self, key), key
try:
Expand All @@ -73,11 +108,18 @@ def __init__(
logger.debug(f"Did not load {key} from state_dict")
state_dict = {k[7:]: v for k, v in state_dict.items()}
getattr(self, key).load_state_dict(state_dict, strict=False)
if not for_training:
self.prepare_for_inference()

@property
def device(self):
return self.bert.device

def prepare_for_inference(self) -> int:
removed = strip_weight_norm(self)
logger.debug(f"Removed weight_norm from {removed} modules")
return removed

@dataclass
class Output:
audio: torch.FloatTensor
Expand Down
Loading