Skip to content

feat: add in-process OpenAI-compatible STT API service#245

Draft
krystophny wants to merge 5 commits intomainfrom
feature/single-daemon-openai-stt-api
Draft

feat: add in-process OpenAI-compatible STT API service#245
krystophny wants to merge 5 commits intomainfrom
feature/single-daemon-openai-stt-api

Conversation

@krystophny
Copy link
Copy Markdown
Collaborator

@krystophny krystophny commented Mar 1, 2026

Summary

This PR adds an in-process OpenAI-compatible STT HTTP service to Voxtype. The service runs alongside the daemon and reuses the daemon's transcriber, so Voxtype does not load the Whisper model twice when both push-to-talk STT and an API endpoint are needed.

Scope

  • Adds /healthz, /v1/audio/transcriptions, and /v1/audio/translations.
  • Adds service config, CLI flags, and environment overrides for bind host/port, request timeout, upload limits, and allowed languages.
  • Adds request-level language and prompt overrides for the Whisper transcriber.
  • Adds json, text, and verbose_json response handling, including segment timestamps for long-form chunking.
  • Keeps a bounded Whisper state pool for concurrent service requests.

This branch is intentionally independent from macOS support. It is based on main and does not include the macOS files from #129. For local Mac testing with both feature sets stacked, use feature/macos-openai-stt-stack.

Design Notes

  • No auth is implemented in the local Voxtype service; the intended deployment is loopback/private LAN first.
  • The service can constrain accepted request languages through config.
  • Audio upload decode/downmix/resample to 16 kHz mono is handled server-side.

Verification

Executed on mailuefterl in /tmp/voxtype-openai.J5NRzM/repo after applying this branch on top of main:

$ cargo check
Finished `dev` profile
$ cargo test
549 unit tests passed; 25 integration tests passed; 0 failed
$ cargo build
Finished `dev` profile

Stacked Mac branch verification on feature/macos-openai-stt-stack:

$ cargo build --release --features gpu-metal
Finished `release` profile
$ curl -fsS http://127.0.0.1:8427/healthz
{"status":"ok"}
$ curl -fsS -F file=@tests/fixtures/vad/speech_hello.wav -F model=large-v3-turbo -F response_format=json -F language=en http://127.0.0.1:8427/v1/audio/transcriptions
{"text":"Hello world"}

Closes #244

The service previously created its own transcriber instance, loading
the same model into GPU memory a second time. Now the daemon passes
its existing transcriber via Arc to the service, halving VRAM usage.
Falls back to creating a separate instance when no shared transcriber
is available (on-demand loading, gpu_isolation).
Return per-segment start/end timestamps when response_format is
verbose_json. Adds transcribe_segments method to Transcriber trait
with default fallback and WhisperTranscriber override that extracts
real timestamps from whisper-rs segment iterator.

Bumps default max_upload_bytes to 200MB and request_timeout_ms to
600s to support long audio files.
@krystophny krystophny force-pushed the feature/single-daemon-openai-stt-api branch from a7cae79 to ec9a74d Compare April 23, 2026 05:49
@krystophny krystophny changed the title feat: single daemon with local OpenAI-compatible STT service feat: add in-process OpenAI-compatible STT API service Apr 23, 2026
@krystophny krystophny marked this pull request as ready for review April 23, 2026 06:22
@krystophny krystophny requested a review from peteonrails as a code owner April 23, 2026 06:22
@krystophny krystophny marked this pull request as draft April 23, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: single daemon for hotkey dictation + OpenAI-compatible local STT API

1 participant