hexgrad · JINZIPING · Apr 5, 2026 · Apr 29, 2026 · May 11, 2026
diff --git a/README.md b/README.md
@@ -1,129 +1,177 @@
-# kokoro
+# kokoro_inference
 
-An inference library for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M). You can [`pip install kokoro`](https://pypi.org/project/kokoro/).
+Personal fork of `kokoro` focused on inference-time performance experiments for
+[Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M).
 
-> **Kokoro** is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
+The main change in this fork is that `KModel` prepares the network for
+inference by stripping PyTorch `weight_norm` parametrizations after checkpoint
+loading. The normalized weights are materialized once with
+`leave_parametrized=True`, so inference no longer pays the parametrization
+overhead on every forward pass.
 
-### Usage
-You can run this basic cell on [Google Colab](https://colab.research.google.com/). [Listen to samples](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/SAMPLES.md).
-```py
-!pip install -q kokoro>=0.9.4 soundfile
-!apt-get -qq -y install espeak-ng > /dev/null 2>&1
+This is an inference optimization project, not a training fork. The goal is to
+keep model behavior compatible with upstream Kokoro while making the loaded
+model cheaper to run.
+
+## What Changed
+
+- `KModel` strips `weight_norm` by default during initialization.
+- `KModel.prepare_for_inference()` can be called manually and returns the number
+  of parametrized weights removed.
+- `for_training=True` keeps the original training-time parametrizations.
+- `local_repo_dir` support lets `KModel` and `KPipeline` load a repo-shaped local
+  model directory instead of downloading from Hugging Face.
+
+Only `weight_norm` parametrizations are removed. Normalization layers that are
+part of the model computation, such as layer norm and instance norm, are left in
+place.
+
+## Install
+
+From this checkout:
+
+```bash
+pip install -e .
+```
+
+Kokoro uses `misaki` for grapheme-to-phoneme conversion. English support is
+included through the package dependency. Some languages and fallback paths also
+need `espeak-ng` installed on the system.
+
+macOS:
+
+```bash
+brew install espeak-ng
+```
+
+Debian/Ubuntu:
+
+```bash
+sudo apt-get install espeak-ng
+```
+
+## Basic Usage
+
+```python
 from kokoro import KPipeline
-from IPython.display import display, Audio
 import soundfile as sf
-import torch
-pipeline = KPipeline(lang_code='a')
-text = '''
-[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
-'''
-generator = pipeline(text, voice='af_heart')
-for i, (gs, ps, audio) in enumerate(generator):
-    print(i, gs, ps)
-    display(Audio(data=audio, rate=24000, autoplay=i==0))
-    sf.write(f'{i}.wav', audio, 24000)
+
+pipeline = KPipeline(lang_code="a")
+
+text = "Kokoro is an open-weight text-to-speech model."
+generator = pipeline(text, voice="af_heart")
+
+for index, result in enumerate(generator):
+    if result.audio is None:
+        continue
+    sf.write(f"{index}.wav", result.audio, 24000)
 ```
-Under the hood, `kokoro` uses [`misaki`](https://pypi.org/project/misaki/), a G2P library at https://github.com/hexgrad/misaki
 
-### Advanced Usage
-You can run this advanced cell on [Google Colab](https://colab.research.google.com/).
-```py
-# 1️⃣ Install kokoro
-!pip install -q kokoro>=0.9.4 soundfile
-# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
-!apt-get -qq -y install espeak-ng > /dev/null 2>&1
+The default `KPipeline` path creates a `KModel`, moves it to the selected device,
+sets it to eval mode, and uses the stripped-weight inference form.
+
+## Manual Inference Preparation
+
+Use this if you want to inspect or control the stripping step directly:
 
-# 3️⃣ Initalize a pipeline
+```python
+from kokoro import KModel
+
+model = KModel(for_training=True)
+removed = model.prepare_for_inference()
+model.eval()
+
+print(f"removed {removed} weight_norm parametrizations")
+```
+
+For training or fine-tuning experiments, keep the original parametrized modules:
+
+```python
+from kokoro import KModel
+
+model = KModel(for_training=True)
+```
+
+## Local Model Files
+
+To avoid Hugging Face downloads, provide a local directory containing
+`config.json`, the Kokoro checkpoint, and `voices/`:
+
+```python
 from kokoro import KPipeline
-from IPython.display import display, Audio
-import soundfile as sf
-import torch
-# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
-# 🇪🇸 'e' => Spanish es
-# 🇫🇷 'f' => French fr-fr
-# 🇮🇳 'h' => Hindi hi
-# 🇮🇹 'i' => Italian it
-# 🇯🇵 'j' => Japanese: pip install misaki[ja]
-# 🇧🇷 'p' => Brazilian Portuguese pt-br
-# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
-pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.
-
-# This text is for demonstration purposes only, unseen during training
-text = '''
-The sky above the port was the color of television, tuned to a dead channel.
-"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
-It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
-
-These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
-
-[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
-'''
-# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
-# text = '中國人民不信邪也不怕邪，不惹事也不怕事，任何外國不要指望我們會拿自己的核心利益做交易，不要指望我們會吞下損害我國主權、安全、發展利益的苦果！'
-# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
-# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
-# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
-# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
-# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'
-
-# 4️⃣ Generate, display, and save audio files in a loop.
-generator = pipeline(
-    text, voice='af_heart', # <= change voice here
-    speed=1, split_pattern=r'\n+'
+
+pipeline = KPipeline(
+    lang_code="a",
+    local_repo_dir="./checkpoints/Kokoro-82M",
 )
-# Alternatively, load voice tensor directly:
-# voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
-# generator = pipeline(
-#     text, voice=voice_tensor,
-#     speed=1, split_pattern=r'\n+'
-# )
-
-for i, (gs, ps, audio) in enumerate(generator):
-    print(i)  # i => index
-    print(gs) # gs => graphemes/text
-    print(ps) # ps => phonemes
-    display(Audio(data=audio, rate=24000, autoplay=i==0))
-    sf.write(f'{i}.wav', audio, 24000) # save each audio file
 ```
 
-### Windows Installation
-To install espeak-ng on Windows:
-1. Go to [espeak-ng releases](https://github.com/espeak-ng/espeak-ng/releases)
-2. Click on **Latest release** 
-3. Download the appropriate `*.msi` file (e.g. **espeak-ng-20191129-b702b03-x64.msi**)
-4. Run the downloaded installer
-
-For advanced configuration and usage on Windows, see the [official espeak-ng Windows guide](https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md)
+The directory is expected to look like a model repo:
 
-### MacOS Apple Silicon GPU Acceleration
+```text
+checkpoints/Kokoro-82M/
+  config.json
+  kokoro-v1_0.pth
+  voices/
+```
 
-On Mac M1/M2/M3/M4 devices, you can explicitly specify the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to enable GPU acceleration.
+## CLI
 
 ```bash
-PYTORCH_ENABLE_MPS_FALLBACK=1 python run-your-kokoro-script.py
+python -m kokoro \
+  --text "The quick brown fox jumps over the lazy dog." \
+  --voice af_heart \
+  --output-file out.wav
+```
+
+## Optimization Notes
+
+Kokoro inherits StyleTTS-style modules that wrap many convolution layers with
+`torch.nn.utils.parametrizations.weight_norm`. That representation is useful
+while training because the effective weight is derived from separate magnitude
+and direction parameters.
+
+For inference, the effective weight can be computed once after loading the
+checkpoint. This fork does that with:
+
+```python
+torch.nn.utils.parametrize.remove_parametrizations(
+    module,
+    "weight",
+    leave_parametrized=True,
+)
 ```
 
-### Conda Environment
-Use the following conda `environment.yml` if you're facing any dependency issues.
-```yaml
-name: kokoro
-channels:
-  - defaults
-dependencies:
-  - python==3.9       
-  - libstdcxx~=12.4.0 # Needed to load espeak correctly. Try removing this if you're facing issues with Espeak fallback. 
-  - pip:
-      - kokoro>=0.3.1
-      - soundfile
-      - misaki[en]
+Expected impact:
+
+- lower Python and parametrization overhead during repeated forward passes
+- simpler module state for export and deployment experiments
+- no intentional checkpoint or architecture change
+
+Actual speedup depends on hardware, backend, text length, batching, and whether
+the workload is dominated by model execution or text processing. Benchmark on
+your deployment target before treating this as a production optimization.
+
+## Development
+
+Run the test suite:
+
+```bash
+pytest
 ```
 
-### Acknowledgements
-- 🛠️ [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2.
-- 🏆 [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena.
-- 📊 Thank you to everyone who contributed synthetic training data.
-- ❤️ Special thanks to all compute sponsors.
-- 👾 Discord server: https://discord.gg/QuGxSWBfQy
-- 🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also a [character in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro) along with [Misaki](https://github.com/hexgrad/misaki?tab=readme-ov-file#acknowledgements).
+Useful files:
+
+- `kokoro/model.py`: model loading and inference preparation
+- `kokoro/modules.py`: text encoder and predictor modules
+- `kokoro/istftnet.py`: decoder modules using `weight_norm`
+- `examples/device_examples.py`: simple device timing example
+- `examples/export.py`: ONNX export experiment
+
+## Upstream
+
+This repository is a personal inference-focused fork of upstream Kokoro. For the
+original project, model cards, voices, and samples, see:
 
-<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
+- https://github.com/hexgrad/kokoro
+- https://huggingface.co/hexgrad/Kokoro-82M
diff --git a/kokoro/model.py b/kokoro/model.py
@@ -1,13 +1,39 @@
 from .istftnet import Decoder
 from .modules import CustomAlbert, ProsodyPredictor, TextEncoder
 from dataclasses import dataclass
-from huggingface_hub import hf_hub_download
 from loguru import logger
+from torch.nn.utils import parametrize
 from transformers import AlbertConfig
 from typing import Dict, Optional, Union
 import json
+import os
 import torch
 
+
+def resolve_repo_file(
+    repo_id: str,
+    filename: str,
+    local_repo_dir: Optional[str] = None,
+) -> str:
+    if local_repo_dir is not None:
+        local_repo_dir = os.path.expanduser(local_repo_dir)
+        path = os.path.join(local_repo_dir, filename)
+        if not os.path.isfile(path):
+            raise FileNotFoundError(path)
+        return path
+    from huggingface_hub import hf_hub_download
+    return hf_hub_download(repo_id=repo_id, filename=filename)
+
+
+def strip_weight_norm(module: torch.nn.Module) -> int:
+    removed = 0
+    for child in module.modules():
+        if parametrize.is_parametrized(child, "weight"):
+            parametrize.remove_parametrizations(child, "weight", leave_parametrized=True)
+            removed += 1
+    return removed
+
+
 class KModel(torch.nn.Module):
     '''
     KModel is a torch.nn.Module with 2 main responsibilities:
@@ -33,7 +59,9 @@ def __init__(
         repo_id: Optional[str] = None,
         config: Union[Dict, str, None] = None,
         model: Optional[str] = None,
-        disable_complex: bool = False
+        disable_complex: bool = False,
+        local_repo_dir: Optional[str] = None,
+        for_training: bool = False,
     ):
         super().__init__()
         if repo_id is None:
@@ -42,8 +70,11 @@ def __init__(
         self.repo_id = repo_id
         if not isinstance(config, dict):
             if not config:
-                logger.debug("No config provided, downloading from HF")
-                config = hf_hub_download(repo_id=repo_id, filename='config.json')
+                config = resolve_repo_file(
+                    repo_id=repo_id,
+                    filename='config.json',
+                    local_repo_dir=local_repo_dir,
+                )
             with open(config, 'r', encoding='utf-8') as r:
                 config = json.load(r)
                 logger.debug(f"Loaded config: {config}")
@@ -64,7 +95,11 @@ def __init__(
             dim_out=config['n_mels'], disable_complex=disable_complex, **config['istftnet']
         )
         if not model:
-            model = hf_hub_download(repo_id=repo_id, filename=KModel.MODEL_NAMES[repo_id])
+            model = resolve_repo_file(
+                repo_id=repo_id,
+                filename=KModel.MODEL_NAMES[repo_id],
+                local_repo_dir=local_repo_dir,
+            )
         for key, state_dict in torch.load(model, map_location='cpu', weights_only=True).items():
             assert hasattr(self, key), key
             try:
@@ -73,11 +108,18 @@ def __init__(
                 logger.debug(f"Did not load {key} from state_dict")
                 state_dict = {k[7:]: v for k, v in state_dict.items()}
                 getattr(self, key).load_state_dict(state_dict, strict=False)
+        if not for_training:
+            self.prepare_for_inference()
 
     @property
     def device(self):
         return self.bert.device
 
+    def prepare_for_inference(self) -> int:
+        removed = strip_weight_norm(self)
+        logger.debug(f"Removed weight_norm from {removed} modules")
+        return removed
+
     @dataclass
     class Output:
         audio: torch.FloatTensor