plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final by sindarknave · Pull Request #5505 · livekit/agents

sindarknave · 2026-04-20T22:31:52Z

Problem

Deepgram's streaming API emits is_final=true multiple times per spoken utterance as it stabilizes word batches internally. This is independent of the speech_final signal, which indicates true endpoint (silence past endpointing_ms).

The plugin currently forwards every is_final=true as stt.SpeechEventType.FINAL_TRANSCRIPT (stt.py Results branch). Downstream, livekit-agents treats each FINAL_TRANSCRIPT as a transcript-segment/turn boundary:

_UserTranscriptionOutput.flush() mints a new segment id on each final, splitting one spoken utterance across multiple rtc.TranscriptionSegments in RoomEvent.TranscriptionReceived.
_user_input_transcribed fires multiple is_final=True events per utterance.

So one sentence like "I walk into the tavern and look around" becomes 3 bubbles in a transcription UI and 3 analytics events, and endpointing_ms doesn't control the boundary at all — it only controls speech_final, which the plugin doesn't treat as authoritative.

The transcript text across those consecutive is_final events can also overlap — we've observed cases where Deepgram revises earlier words, so chunk N+1's transcript re-emits chunk N's words (e.g. "Then you seek" followed by "Then you seek incorrectly."), which makes naive per-final concatenation double up.

Change

Add an opt-in final_on_endpoint: bool = False option to deepgram.STT(...):

When False (default): behavior is unchanged — every is_final=true emits FINAL_TRANSCRIPT.
When True: FINAL_TRANSCRIPT is emitted only on speech_final=true. Intermediate is_final=true events become INTERIM_TRANSCRIPT instead, so downstream sees one final per utterance.

This makes endpointing_ms the authoritative control over transcript-segment boundaries, which matches the mental model most consumers have for a "final" transcript.

Trade-off

Opt-in mode increases FINAL_TRANSCRIPT latency by up to endpointing_ms (you're waiting for endpoint rather than a within-utterance is_final). For apps that want streaming-display of stable partials as quickly as possible, the default False still fits.

Scope

No behavior change when the flag is unset.
Option is plumbed through STTOptions, STT.__init__, STT.update_options, and SpeechStream.update_options for consistency.
No test added — the plugin doesn't currently have unit tests for the Results handler; happy to add one if you point me at the right harness.

…eech_final Deepgram's streaming API emits `is_final=true` multiple times per spoken utterance as it stabilizes word batches, and the `transcript` field on subsequent `is_final` events can overlap when the model revises earlier words. The plugin currently forwards every `is_final=true` as a `FINAL_TRANSCRIPT`, which splits one utterance into many downstream turn boundaries and surfaces duplicated/overlapping text to consumers doing transcript segmentation or analytics per final event. Add an opt-in `final_on_endpoint` flag (default False, behavior unchanged) that emits `FINAL_TRANSCRIPT` only on `speech_final=true` (endpoint detected) and treats intermediate `is_final=true` as `INTERIM_TRANSCRIPT`. This makes `endpointing_ms` the authoritative control over transcript-segment boundaries. Trade-off: `FINAL_TRANSCRIPT` latency grows by up to `endpointing_ms`.

…nt mode Addresses review: demoting intermediate is_final=True events to INTERIM_TRANSCRIPT would drop their text, because the downstream consumer in audio_recognition.py overwrites (rather than accumulates) the interim buffer on each INTERIM and clears it on FINAL. Fix: accumulate the text of intermediate is_final batches internally on the SpeechStream (self._pending_final_alt). Emit INTERIM_TRANSCRIPT events during the utterance carrying the cumulative text so downstream consumers that overwrite their interim buffer still see the full utterance-so-far, then emit a single FINAL_TRANSCRIPT on speech_final with the combined text. Deepgram sometimes emits cumulative text (a new batch's transcript re-includes prior words after the model revises an earlier hypothesis) and sometimes emits purely new words. Detect the cumulative case via a prefix check and replace-rather-than-append, to avoid doubling up overlapping text. Reset the accumulator on speech_final (success path), END_OF_SPEECH (safety), and reconnect (don't splice fragments across a stream).

devin-ai-integration

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

devin-ai-integration · 2026-04-20T22:50:16Z

            if is_endpoint and self._speaking:
                self._speaking = False
+                # Safety: clear any partial accumulator after speech ends — the
+                # FINAL branch above normally resets it, but this guards against
+                # speech_final arriving with no is_final batches accumulated.
+                self._pending_final_alt = None


🔴 Accumulated transcripts silently lost when endpoint fires with empty text in final_on_endpoint mode

When final_on_endpoint=True, intermediate is_final=True batches are accumulated in self._pending_final_alt and only emitted as FINAL_TRANSCRIPT when speech_final=True (the endpoint). However, the entire accumulation/emission logic is gated by if len(alts) > 0 and alts[0].text: (line 723), which is falsy when the endpoint batch carries an empty transcript. Deepgram commonly sends speech_final=True with an empty transcript when all words were already finalized in prior batches — exactly the scenario final_on_endpoint is designed for. When this happens, the code skips past all the final_on_endpoint logic, and then the guard at line 789-794 clears self._pending_final_alt = None without ever emitting the accumulated text. The entire utterance's transcript is silently dropped.

Example scenario

DG sends is_final=True, speech_final=False, text="hello" → accumulated in _pending_final_alt

DG sends is_final=True, speech_final=False, text="world" → merged into _pending_final_alt

DG sends is_final=True, speech_final=True, text="" → alts[0].text is falsy, inner block skipped

Line 794: self._pending_final_alt = None → accumulated "hello world" lost, no FINAL_TRANSCRIPT emitted

Prompt for agents

In _process_stream_event, when final_on_endpoint is True and speech_final=True arrives, the accumulated _pending_final_alt must be emitted as a FINAL_TRANSCRIPT even if the current batch's text is empty. The fix should be applied near the endpoint handling block (around line 789). Before clearing _pending_final_alt and emitting END_OF_SPEECH, check if final_on_endpoint is True and _pending_final_alt is not None. If so, emit a FINAL_TRANSCRIPT event using the pending accumulated data. This could be done by moving the endpoint+accumulator flush logic outside the `if len(alts) > 0 and alts[0].text:` guard, or by adding a separate check in the `if is_endpoint and self._speaking:` block that emits the pending data before clearing it.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-20T22:50:17Z

+    start_time = pending.start_time if pending.start_time else incoming.start_time
+    if incoming.start_time:
+        start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time


🟡 Falsy check on start_time treats valid 0.0 timestamp as missing

In _merge_speech_data, line 831 uses if pending.start_time to decide whether to use pending.start_time or fall back to incoming.start_time. Since start_time is a float defaulting to 0.0 (stt.py:57), a legitimate start time of 0.0 (the beginning of the audio stream) is treated as falsy/missing, causing the code to incorrectly use incoming.start_time (a later timestamp) instead. The same pattern on line 832-833 (if incoming.start_time, if start_time) has the same issue. The correct check should use is not None or explicit comparison, but since start_time can never be None (it's float), the intent is likely pending.start_time is not None which is always true — so the logic should simply be min(pending.start_time, incoming.start_time).

Suggested change

start_time = pending.start_time if pending.start_time else incoming.start_time

if incoming.start_time:

start_time = min(start_time, incoming.start_time) if start_time else incoming.start_time

start_time = min(pending.start_time, incoming.start_time)

Was this helpful? React with 👍 or 👎 to provide feedback.

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Apr 20, 2026

View reviewed changes

sindarknave marked this pull request as draft April 21, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505

plugins-deepgram: add final_on_endpoint option to gate FINAL_TRANSCRIPT on speech_final#5505
sindarknave wants to merge 2 commits intolivekit:mainfrom
sindarknave:brian/deepgram-speech-final-boundary

sindarknave commented Apr 20, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 20, 2026

Uh oh!

devin-ai-integration Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sindarknave commented Apr 20, 2026

Problem

Change

Trade-off

Scope

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant