Skip to content

plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505

Draft
sindarknave wants to merge 2 commits intolivekit:mainfrom
sindarknave:brian/deepgram-speech-final-boundary
Draft

plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505
sindarknave wants to merge 2 commits intolivekit:mainfrom
sindarknave:brian/deepgram-speech-final-boundary

Conversation

@sindarknave
Copy link
Copy Markdown
Contributor

Problem

Deepgram's streaming API emits is_final=true multiple times per spoken utterance as it stabilizes word batches internally. This is independent of the speech_final signal, which indicates true endpoint (silence past endpointing_ms).

The plugin currently forwards every is_final=true as stt.SpeechEventType.FINAL_TRANSCRIPT (stt.py Results branch). Downstream, livekit-agents treats each FINAL_TRANSCRIPT as a transcript-segment/turn boundary:

  • _UserTranscriptionOutput.flush() mints a new segment id on each final, splitting one spoken utterance across multiple rtc.TranscriptionSegments in RoomEvent.TranscriptionReceived.
  • _user_input_transcribed fires multiple is_final=True events per utterance.

So one sentence like "I walk into the tavern and look around" becomes 3 bubbles in a transcription UI and 3 analytics events, and endpointing_ms doesn't control the boundary at all β€” it only controls speech_final, which the plugin doesn't treat as authoritative.

The transcript text across those consecutive is_final events can also overlap β€” we've observed cases where Deepgram revises earlier words, so chunk N+1's transcript re-emits chunk N's words (e.g. "Then you seek" followed by "Then you seek incorrectly."), which makes naive per-final concatenation double up.

Change

Add an opt-in final_on_endpoint: bool = False option to deepgram.STT(...):

  • When False (default): behavior is unchanged β€” every is_final=true emits FINAL_TRANSCRIPT.
  • When True: FINAL_TRANSCRIPT is emitted only on speech_final=true. Intermediate is_final=true events become INTERIM_TRANSCRIPT instead, so downstream sees one final per utterance.

This makes endpointing_ms the authoritative control over transcript-segment boundaries, which matches the mental model most consumers have for a "final" transcript.

Trade-off

Opt-in mode increases FINAL_TRANSCRIPT latency by up to endpointing_ms (you're waiting for endpoint rather than a within-utterance is_final). For apps that want streaming-display of stable partials as quickly as possible, the default False still fits.

Scope

  • No behavior change when the flag is unset.
  • Option is plumbed through STTOptions, STT.__init__, STT.update_options, and SpeechStream.update_options for consistency.
  • No test added β€” the plugin doesn't currently have unit tests for the Results handler; happy to add one if you point me at the right harness.

…eech_final

Deepgram's streaming API emits `is_final=true` multiple times per spoken
utterance as it stabilizes word batches, and the `transcript` field on
subsequent `is_final` events can overlap when the model revises earlier
words. The plugin currently forwards every `is_final=true` as a
`FINAL_TRANSCRIPT`, which splits one utterance into many downstream turn
boundaries and surfaces duplicated/overlapping text to consumers doing
transcript segmentation or analytics per final event.

Add an opt-in `final_on_endpoint` flag (default False, behavior
unchanged) that emits `FINAL_TRANSCRIPT` only on `speech_final=true`
(endpoint detected) and treats intermediate `is_final=true` as
`INTERIM_TRANSCRIPT`. This makes `endpointing_ms` the authoritative
control over transcript-segment boundaries. Trade-off: `FINAL_TRANSCRIPT`
latency grows by up to `endpointing_ms`.
devin-ai-integration[bot]

This comment was marked as resolved.

…nt mode

Addresses review: demoting intermediate is_final=True events to
INTERIM_TRANSCRIPT would drop their text, because the downstream consumer
in audio_recognition.py overwrites (rather than accumulates) the interim
buffer on each INTERIM and clears it on FINAL.

Fix: accumulate the text of intermediate is_final batches internally on
the SpeechStream (self._pending_final_alt). Emit INTERIM_TRANSCRIPT
events during the utterance carrying the cumulative text so downstream
consumers that overwrite their interim buffer still see the full
utterance-so-far, then emit a single FINAL_TRANSCRIPT on speech_final
with the combined text.

Deepgram sometimes emits cumulative text (a new batch's transcript
re-includes prior words after the model revises an earlier hypothesis)
and sometimes emits purely new words. Detect the cumulative case via a
prefix check and replace-rather-than-append, to avoid doubling up
overlapping text.

Reset the accumulator on speech_final (success path), END_OF_SPEECH
(safety), and reconnect (don't splice fragments across a stream).
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment on lines 789 to +794
if is_endpoint and self._speaking:
self._speaking = False
# Safety: clear any partial accumulator after speech ends β€” the
# FINAL branch above normally resets it, but this guards against
# speech_final arriving with no is_final batches accumulated.
self._pending_final_alt = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ Accumulated transcripts silently lost when endpoint fires with empty text in final_on_endpoint mode

When final_on_endpoint=True, intermediate is_final=True batches are accumulated in self._pending_final_alt and only emitted as FINAL_TRANSCRIPT when speech_final=True (the endpoint). However, the entire accumulation/emission logic is gated by if len(alts) > 0 and alts[0].text: (line 723), which is falsy when the endpoint batch carries an empty transcript. Deepgram commonly sends speech_final=True with an empty transcript when all words were already finalized in prior batches β€” exactly the scenario final_on_endpoint is designed for. When this happens, the code skips past all the final_on_endpoint logic, and then the guard at line 789-794 clears self._pending_final_alt = None without ever emitting the accumulated text. The entire utterance's transcript is silently dropped.

Example scenario
  1. DG sends is_final=True, speech_final=False, text="hello" β†’ accumulated in _pending_final_alt
  2. DG sends is_final=True, speech_final=False, text="world" β†’ merged into _pending_final_alt
  3. DG sends is_final=True, speech_final=True, text="" β†’ alts[0].text is falsy, inner block skipped
  4. Line 794: self._pending_final_alt = None β†’ accumulated "hello world" lost, no FINAL_TRANSCRIPT emitted
Prompt for agents
In _process_stream_event, when final_on_endpoint is True and speech_final=True arrives, the accumulated _pending_final_alt must be emitted as a FINAL_TRANSCRIPT even if the current batch's text is empty. The fix should be applied near the endpoint handling block (around line 789). Before clearing _pending_final_alt and emitting END_OF_SPEECH, check if final_on_endpoint is True and _pending_final_alt is not None. If so, emit a FINAL_TRANSCRIPT event using the pending accumulated data. This could be done by moving the endpoint+accumulator flush logic outside the `if len(alts) > 0 and alts[0].text:` guard, or by adding a separate check in the `if is_endpoint and self._speaking:` block that emits the pending data before clearing it.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +831 to +833
start_time = pending.start_time if pending.start_time else incoming.start_time
if incoming.start_time:
start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟑 Falsy check on start_time treats valid 0.0 timestamp as missing

In _merge_speech_data, line 831 uses if pending.start_time to decide whether to use pending.start_time or fall back to incoming.start_time. Since start_time is a float defaulting to 0.0 (stt.py:57), a legitimate start time of 0.0 (the beginning of the audio stream) is treated as falsy/missing, causing the code to incorrectly use incoming.start_time (a later timestamp) instead. The same pattern on line 832-833 (if incoming.start_time, if start_time) has the same issue. The correct check should use is not None or explicit comparison, but since start_time can never be None (it's float), the intent is likely pending.start_time is not None which is always true β€” so the logic should simply be min(pending.start_time, incoming.start_time).

Suggested change
start_time = pending.start_time if pending.start_time else incoming.start_time
if incoming.start_time:
start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time
start_time = min(pending.start_time, incoming.start_time)
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

@sindarknave sindarknave marked this pull request as draft April 21, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant