feat(minimax): port MiniMax TTS plugin from Python agents#1287
feat(minimax): port MiniMax TTS plugin from Python agents#1287toubatbrian wants to merge 13 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: c34b65d The changes in this PR will be included in the next version bump. This PR includes changesets to release 27 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05c9ccdba6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const audioHex = data?.data?.audio as string | undefined; | ||
| if (audioHex) { | ||
| const audio = hexToBuffer(audioHex); | ||
| for (const frame of bstream.write(audio.buffer)) { |
There was a problem hiding this comment.
Preserve byte offsets when writing decoded audio
Both synthesis paths decode hex with Buffer.from(...) and then pass audio.buffer to AudioByteStream. In Node this Buffer is typically a slice of a pooled ArrayBuffer (byteOffset is non-zero), so using .buffer feeds unrelated bytes before/after the actual chunk, which corrupts PCM output and frame boundaries. Pass the Buffer/view directly (or slice the backing buffer with byteOffset and byteLength) here and in the matching WebSocket path.
Useful? React with 👍 / 👎.
| @@ -0,0 +1,52 @@ | |||
| { | |||
| "name": "@livekit/agents-plugin-minimax", | |||
There was a problem hiding this comment.
Update lockfile when adding the new workspace package
This commit introduces plugins/minimax/package.json as a new workspace package, but pnpm-lock.yaml was not regenerated to include a plugins/minimax importer. Because CI runs pnpm install --frozen-lockfile in build/test workflows, installs will fail when the lockfile is out of sync with workspace manifests. Regenerating and committing the lockfile with this package is required to keep CI green.
Useful? React with 👍 / 👎.
- Regenerate pnpm-lock.yaml so frozen-lockfile install finds the new minimax workspace package (otherwise Build/Test/Formatting all fail at install time). - Add MINIMAX_API_KEY and MINIMAX_BASE_URL to turbo.json globalEnv so eslint-config-turbo stops rejecting the env var references. - Pass the Node.js Buffer directly to AudioByteStream.write instead of unwrapping to Buffer.buffer. Node pools small Buffers inside a larger ArrayBuffer, so .buffer exposed ~8KB of unrelated pool memory and corrupted PCM output (flagged by Devin and Codex review). - Fix three TS2322 errors: APIStatusError.options.body must be an object or null, not a JSON string. - Add tts.test.ts with the standard 'skip when API key missing' pattern used by rime/cartesia/neuphonic. - Add changeset entry.
|
Thanks for the reviews @devin-ai-integration @chatgpt-codex-connector — addressed in
Also fixed while in the area:
Generated by Claude Code |
Summary
Testing
|
|
|
- updateOptions now re-validates merged opts via an extracted validateOptions helper, so speed/intensity/timbre/emotion+model constraints surface locally rather than as a server-side error. (Python's update_options silently assigns; the JS port tightens this.) - Add the CLAUDE.md-required // Ref: python comments to every ported symbol in src/tts.ts and src/models.ts so reviewers can cross-check each function/type against the Python source.
Promise.race between taskStarted.await and the setTimeout-based timeout promise leaked the timer. On successful task_start, the timer still fired and called reject() on an already-settled race, producing an unhandled promise rejection. Capture the handle and clearTimeout in a finally block. Matches the pattern used by waitForWebSocketOpen later in the same file and by cartesia.
#tokenStream is created once in the constructor; the TTS base class retries run() on retryable errors (timeout, 5xx, 429), so closing the stream here made every retry push into a closed stream and silently drop user input. Only close() closes it now, matching how cartesia handles its tokenizer stream.
…tryable - All new files now use SPDX-FileCopyrightText: 2026 per CLAUDE.md. - When the server returns a non-zero base_resp.status_code, pass retryable: false to APIStatusError. These are MiniMax app-level codes (e.g. 1002 invalid param, 1004 auth), not HTTP status codes, so the default retryability heuristic would incorrectly retry permanent errors. Applies to both the HTTP SSE and WebSocket paths.
Summary
Ports the MiniMax TTS plugin from
livekit/agents(Python) intoagents-js.Triggered by automated routine observing merged PR livekit/agents#5518 ("(minimax): add new TTS models") — the Python change simply added
speech-2.8-hdandspeech-2.8-turboto the already-existing MiniMax plugin. Because the MiniMax plugin did not yet exist inagents-js, this PR creates the full plugin scaffold (including the new 2.8 models) rather than attempting to land the two model literals alone.Closes the JS-side gap with
livekit-plugins-minimax(Python).What's included
New workspace package
@livekit/agents-plugin-minimaxatplugins/minimax/:package.json,tsconfig.json,tsup.config.ts,api-extractor.json,README.md— matches the conventions of existing plugins (e.g.rime,cartesia,neuphonic).src/index.ts— plugin registration.src/models.ts— literal types (TTSModel,TTSVoice,TTSEmotion,TTSLanguageBoost,TTSSampleRate), defaults (DEFAULT_MODEL,DEFAULT_VOICE_ID,DEFAULT_BASE_URL). Includes the newspeech-2.8-hdandspeech-2.8-turbomodel strings from (minimax): add new TTS models agents#5518.src/tts.ts:TTSclass with the same capability surface as the Python version (streaming: true,alignedTranscript: false).ChunkedStream: one-shot synthesis via HTTP SSE (POST /v1/t2a_v2,stream: true,exclude_aggregated_audio: true), hex-decoding the audio chunks and pushing PCM frames into anAudioByteStream.SynthesizeStream: real-time WebSocket synthesis via/ws/v1/t2a_v2with thetask_start/task_continue/task_finishevent protocol. Uses a sentence tokenizer to chunk incoming text.updateOptions()parity with the Pythonupdate_options.speed ∈ [0.5, 2.0],intensity ∈ [-100, 100],timbre ∈ [-100, 100], and thefluentemotion is only accepted forspeech-2.6-*models.APIConnectionError/APIStatusError/APITimeoutError/APIErrorwith MiniMaxtrace_idpropagation on both HTTP and WS paths.Implementation notes where JS differs from Python
Code-level parity is mostly 1:1, except:
Audio format is restricted to PCM. The Python plugin exposes a
audio_formatoption withpcm | mp3 | flac | wav.@livekit/agents'sAudioByteStreamis designed around raw PCM samples — decoding MP3/FLAC/WAV on the fly would require pulling in an external decoder and wiring it through the TTS pipeline, which is out of scope for the initial port.format: "pcm"is always sent on the wire, and the incoming hex audio is fed straight intoAudioByteStream. The publicbitrateoption is still accepted for API parity, but is effectively ignored by MiniMax whenformat=pcm.No
SentenceStreamPacer. Python's optionaltext_pacing(a sentence-level pacer that coordinates with the audio emitter) does not have a counterpart in@livekit/agentsat the moment, so the option is omitted. Users can still pass a customSentenceTokenizerviatokenizer.HTTP client. Python uses
aiohttp; the JS port usesfetchfor the chunked HTTP path (matcheselevenlabs) andwsfor the streaming path (matchescartesia/neuphonic).Request ID / trace ID. Python extracts
Trace-Id/X-Trace-Idfrom response headers andtrace_idfrom the body (bothroot.trace_idandbase_resp.trace_id). JS does the same, preferring header, falling back to body.py.typedmarker is not applicable in the JS world; types ship via the generateddist/index.d.ts.Test plan
pnpm installat repo root picks up the new workspace.pnpm --filter @livekit/agents-plugin-minimax buildsucceeds.pnpm --filter @livekit/agents-plugin-minimax lintis clean.new TTS({ apiKey }).synthesize("hello world")returns PCM audio.new TTS({ apiKey }).stream()with a pushed sentence emits framed PCM chunks end-to-end.emotion: 'fluent'with a non-speech-2.6-*model throws, matching Python behavior.This is an automated port from livekit/agents#5518 by the Claude Code automation routine (experimental).
cc @toubatbrian @livekit/agent-devs