Skip to content

feat(_kokoro_tts): add audio format config with remote API support#1692

Open
Draco-Lunaris wants to merge 1 commit into
agent0ai:mainfrom
Draco-Lunaris:feature/kokoro-tts-audio-format-config
Open

feat(_kokoro_tts): add audio format config with remote API support#1692
Draco-Lunaris wants to merge 1 commit into
agent0ai:mainfrom
Draco-Lunaris:feature/kokoro-tts-audio-format-config

Conversation

@Draco-Lunaris

Copy link
Copy Markdown

Feature Request: Add audio format configuration to Kokoro TTS plugin

Problem

The _kokoro_tts plugin currently hardcodes audio/wav as the output format. WAV files are uncompressed and large — a typical 3-second phrase produces ~65KB of WAV vs ~10KB of MP3 at comparable speech quality. For remote TTS services like Kokoro-FastAPI that support multiple output formats (mp3, wav, opus, flac), users have no way to configure the format without modifying plugin source code.

Additionally, the plugin currently assumes local inference via from kokoro import KPipeline, which requires the kokoro Python package to be installed in the framework's venv. When using a remote Kokoro-FastAPI service, the local import is unnecessary and the plugin should call the remote API instead.

Proposed Solution

Add a response_format config option to the Kokoro TTS plugin, allowing users to choose between wav, mp3, opus, and flac output formats.

Implementation Details

default_config.yaml

voice: af_bella
speed: 1.1
response_format: mp3

helpers/runtime.pynormalize_config()

VALID_FORMATS = {"wav", "mp3", "opus", "flac"}
MIME_TYPES = {
    "wav": "audio/wav",
    "mp3": "audio/mpeg",
    "opus": "audio/opus",
    "flac": "audio/flac",
}

# In normalize_config:
response_format = str(config.get("response_format", normalized["response_format"]) or "").strip().lower()
if response_format in VALID_FORMATS:
    normalized["response_format"] = response_format

helpers/runtime.pysynthesize_sentences()

Pass response_format through to the backend and return the corresponding MIME type.

For local inference (current path):

# soundfile.write() supports WAV and FLAC natively
# For MP3/Opus, use format conversion after WAV generation
sf.write(buffer, combined_audio, 24000, format=format_map[response_format])

For remote API (Kokoro-FastAPI):

json={
    "model": "kokoro",
    "input": text,
    "voice": voice,
    "response_format": response_format,
    "speed": speed,
}

api/synthesize.py

# Instead of hardcoded mime_type:
mime_type = MIME_TYPES.get(cfg["response_format"], "audio/mpeg")
return {
    "success": True,
    "audio": audio,
    "mime_type": mime_type,
}

webui/config.html

Add a format selector:

<div class="field">
  <div class="field-label">
    <div class="field-title">Audio Format</div>
    <div class="field-description">Output format for synthesized audio.</div>
  </div>
  <div class="field-control">
    <select x-model="config.response_format">
      <option value="mp3">MP3 (recommended)</option>
      <option value="wav">WAV (uncompressed)</option>
      <option value="opus">Opus (low bitrate)</option>
      <option value="flac">FLAC (lossless)</option>
    </select>
  </div>
</div>

Benefits

  • ~85% file size reduction for MP3 vs WAV at comparable speech quality
  • Faster network transfer between remote TTS service and browser
  • Browser compatibility — all modern browsers support MP3 playback via speechSynthesis and <audio> elements
  • User choice — lossless/low-latency users can still use WAV; bandwidth-constrained users can use MP3/Opus
  • Forward-compatible — new formats can be added without code changes

Environment

  • Agent Zero version: v1.14+
  • Plugin: _kokoro_tts
  • Tested with: Kokoro-FastAPI v0.9.4 (remote), local kokoro 0.9.4

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9cd5be8bf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/_kokoro_tts/default_config.yaml Outdated
voice: am_puck,am_onyx
voice: am_onyx+am_echo
speed: 1.1
remote_url: http://ares.moon-dragon.us:18890

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require explicit opt-in before using a third-party TTS host

With the shipped default config, any user who enables Kokoro TTS without creating their own config (and any migrated config missing remote_url) will have every synthesized text POSTed to ares.moon-dragon.us. That is a privacy and availability regression from the previous local-only path, because normal spoken chat content leaves the user's deployment by default; make the remote URL empty/localhost and require users to explicitly configure a remote service before sending audio requests off-box.

Useful? React with 👍 / 👎.

@Draco-Lunaris

Copy link
Copy Markdown
Author

Found an issue with local / remote TTS usage. Cleaning it up and fixing it now. Should have a proper PR shortly.

- Add response_format config (mp3/wav/opus/flac) with MIME type mapping
- Add remote_url config for optional remote Kokoro-FastAPI server
- If remote_url is set, use remote API for synthesis; otherwise use local model
- If remote_url is set, use remote health check; otherwise use local model status
- Status endpoint reports both local model and remote health (if configured)
- Synthesize endpoint returns (audio, mime_type) tuple for proper content-type
- WebUI config page adds format dropdown and remote URL field
- WebUI main page shows remote health alongside local model status
- Preserves all local synthesis functionality (soundfile, KPipeline, etc.)
- Preserves upstream defaults (voice: am_puck,am_onyx, speed: 1.1)
@Draco-Lunaris-Echo Draco-Lunaris-Echo force-pushed the feature/kokoro-tts-audio-format-config branch from e3e440c to 5eaa508 Compare June 3, 2026 22:53
@Draco-Lunaris

Copy link
Copy Markdown
Author

Cleaned up the PR, Now includes all the original local functionality for TTS. Both local and remove TTS should be unaffected with just the new config additions added to the plugin config. Running the changes local myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant