plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505
plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505sindarknave wants to merge 2 commits intolivekit:mainfrom
Conversation
β¦eech_final Deepgram's streaming API emits `is_final=true` multiple times per spoken utterance as it stabilizes word batches, and the `transcript` field on subsequent `is_final` events can overlap when the model revises earlier words. The plugin currently forwards every `is_final=true` as a `FINAL_TRANSCRIPT`, which splits one utterance into many downstream turn boundaries and surfaces duplicated/overlapping text to consumers doing transcript segmentation or analytics per final event. Add an opt-in `final_on_endpoint` flag (default False, behavior unchanged) that emits `FINAL_TRANSCRIPT` only on `speech_final=true` (endpoint detected) and treats intermediate `is_final=true` as `INTERIM_TRANSCRIPT`. This makes `endpointing_ms` the authoritative control over transcript-segment boundaries. Trade-off: `FINAL_TRANSCRIPT` latency grows by up to `endpointing_ms`.
β¦nt mode Addresses review: demoting intermediate is_final=True events to INTERIM_TRANSCRIPT would drop their text, because the downstream consumer in audio_recognition.py overwrites (rather than accumulates) the interim buffer on each INTERIM and clears it on FINAL. Fix: accumulate the text of intermediate is_final batches internally on the SpeechStream (self._pending_final_alt). Emit INTERIM_TRANSCRIPT events during the utterance carrying the cumulative text so downstream consumers that overwrite their interim buffer still see the full utterance-so-far, then emit a single FINAL_TRANSCRIPT on speech_final with the combined text. Deepgram sometimes emits cumulative text (a new batch's transcript re-includes prior words after the model revises an earlier hypothesis) and sometimes emits purely new words. Detect the cumulative case via a prefix check and replace-rather-than-append, to avoid doubling up overlapping text. Reset the accumulator on speech_final (success path), END_OF_SPEECH (safety), and reconnect (don't splice fragments across a stream).
| if is_endpoint and self._speaking: | ||
| self._speaking = False | ||
| # Safety: clear any partial accumulator after speech ends β the | ||
| # FINAL branch above normally resets it, but this guards against | ||
| # speech_final arriving with no is_final batches accumulated. | ||
| self._pending_final_alt = None |
There was a problem hiding this comment.
π΄ Accumulated transcripts silently lost when endpoint fires with empty text in final_on_endpoint mode
When final_on_endpoint=True, intermediate is_final=True batches are accumulated in self._pending_final_alt and only emitted as FINAL_TRANSCRIPT when speech_final=True (the endpoint). However, the entire accumulation/emission logic is gated by if len(alts) > 0 and alts[0].text: (line 723), which is falsy when the endpoint batch carries an empty transcript. Deepgram commonly sends speech_final=True with an empty transcript when all words were already finalized in prior batches β exactly the scenario final_on_endpoint is designed for. When this happens, the code skips past all the final_on_endpoint logic, and then the guard at line 789-794 clears self._pending_final_alt = None without ever emitting the accumulated text. The entire utterance's transcript is silently dropped.
Example scenario
- DG sends
is_final=True, speech_final=False, text="hello" β accumulated in_pending_final_alt - DG sends
is_final=True, speech_final=False, text="world" β merged into_pending_final_alt - DG sends
is_final=True, speech_final=True, text="" βalts[0].textis falsy, inner block skipped - Line 794:
self._pending_final_alt = Noneβ accumulated "hello world" lost, no FINAL_TRANSCRIPT emitted
Prompt for agents
In _process_stream_event, when final_on_endpoint is True and speech_final=True arrives, the accumulated _pending_final_alt must be emitted as a FINAL_TRANSCRIPT even if the current batch's text is empty. The fix should be applied near the endpoint handling block (around line 789). Before clearing _pending_final_alt and emitting END_OF_SPEECH, check if final_on_endpoint is True and _pending_final_alt is not None. If so, emit a FINAL_TRANSCRIPT event using the pending accumulated data. This could be done by moving the endpoint+accumulator flush logic outside the `if len(alts) > 0 and alts[0].text:` guard, or by adding a separate check in the `if is_endpoint and self._speaking:` block that emits the pending data before clearing it.
Was this helpful? React with π or π to provide feedback.
| start_time = pending.start_time if pending.start_time else incoming.start_time | ||
| if incoming.start_time: | ||
| start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time |
There was a problem hiding this comment.
π‘ Falsy check on start_time treats valid 0.0 timestamp as missing
In _merge_speech_data, line 831 uses if pending.start_time to decide whether to use pending.start_time or fall back to incoming.start_time. Since start_time is a float defaulting to 0.0 (stt.py:57), a legitimate start time of 0.0 (the beginning of the audio stream) is treated as falsy/missing, causing the code to incorrectly use incoming.start_time (a later timestamp) instead. The same pattern on line 832-833 (if incoming.start_time, if start_time) has the same issue. The correct check should use is not None or explicit comparison, but since start_time can never be None (it's float), the intent is likely pending.start_time is not None which is always true β so the logic should simply be min(pending.start_time, incoming.start_time).
| start_time = pending.start_time if pending.start_time else incoming.start_time | |
| if incoming.start_time: | |
| start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time | |
| start_time = min(pending.start_time, incoming.start_time) | |
Was this helpful? React with π or π to provide feedback.
Problem
Deepgram's streaming API emits
is_final=truemultiple times per spoken utterance as it stabilizes word batches internally. This is independent of thespeech_finalsignal, which indicates true endpoint (silence pastendpointing_ms).The plugin currently forwards every
is_final=trueasstt.SpeechEventType.FINAL_TRANSCRIPT(stt.pyResults branch). Downstream, livekit-agents treats eachFINAL_TRANSCRIPTas a transcript-segment/turn boundary:_UserTranscriptionOutput.flush()mints a new segment id on each final, splitting one spoken utterance across multiplertc.TranscriptionSegments inRoomEvent.TranscriptionReceived._user_input_transcribedfires multipleis_final=Trueevents per utterance.So one sentence like "I walk into the tavern and look around" becomes 3 bubbles in a transcription UI and 3 analytics events, and
endpointing_msdoesn't control the boundary at all β it only controlsspeech_final, which the plugin doesn't treat as authoritative.The
transcripttext across those consecutiveis_finalevents can also overlap β we've observed cases where Deepgram revises earlier words, so chunk N+1's transcript re-emits chunk N's words (e.g."Then you seek"followed by"Then you seek incorrectly."), which makes naive per-final concatenation double up.Change
Add an opt-in
final_on_endpoint: bool = Falseoption todeepgram.STT(...):is_final=trueemitsFINAL_TRANSCRIPT.FINAL_TRANSCRIPTis emitted only onspeech_final=true. Intermediateis_final=trueevents becomeINTERIM_TRANSCRIPTinstead, so downstream sees one final per utterance.This makes
endpointing_msthe authoritative control over transcript-segment boundaries, which matches the mental model most consumers have for a "final" transcript.Trade-off
Opt-in mode increases
FINAL_TRANSCRIPTlatency by up toendpointing_ms(you're waiting for endpoint rather than a within-utteranceis_final). For apps that want streaming-display of stable partials as quickly as possible, the default False still fits.Scope
STTOptions,STT.__init__,STT.update_options, andSpeechStream.update_optionsfor consistency.