-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp #5479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gsharp-aai
wants to merge
10
commits into
livekit:main
Choose a base branch
from
gsharp-aai:gsharp/stt-speech-start-time-fallback
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
42b5625
feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp
gsharp-aai f847401
chore: tighten SpeechEvent.speech_start_time docstring, drop changeset
gsharp-aai d9f40a7
fix(stt): only use STT-provided onset when VAD hasn't already set it
gsharp-aai 61b6ec1
refactor(stt): make speech_start_time required, compute at call sites
gsharp-aai ac2429a
fix(assemblyai): reset _stream_wall_start on reconnect; accept timest…
gsharp-aai fa991e1
chore(assemblyai): drop private PR ref and model-specific latency num…
gsharp-aai 2cbdc10
chore(stt): tighten STT SOS handler comment
gsharp-aai 6060b87
fix(stt): pass back-dated onset to user-turn tracing span
gsharp-aai 9195c4d
refactor(stt): add start_time on base SpeechStream, migrate AAI plugin
gsharp-aai 8839283
test(stt): cover start_time field on base SpeechStream + AAI plugin
gsharp-aai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| """Unit tests for base STT `RecognizeStream` fields (start_time, etc.).""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import asyncio | ||
| import time | ||
|
|
||
| import pytest | ||
|
|
||
| from livekit.agents import APIConnectionError | ||
| from livekit.agents.stt import ( | ||
| STT, | ||
| RecognizeStream, | ||
| SpeechData, | ||
| SpeechEvent, | ||
| SpeechEventType, | ||
| STTCapabilities, | ||
| ) | ||
| from livekit.agents.types import DEFAULT_API_CONNECT_OPTIONS | ||
| from livekit.agents.utils.audio import AudioBuffer | ||
|
|
||
|
|
||
| class _DummyStream(RecognizeStream): | ||
| """Minimal RecognizeStream for unit tests — does not hit the network.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| *, | ||
| stt: STT, | ||
| fail_first_run: bool = False, | ||
| ) -> None: | ||
| super().__init__(stt=stt, conn_options=DEFAULT_API_CONNECT_OPTIONS) | ||
| self._fail_first_run = fail_first_run | ||
| self._run_count = 0 | ||
|
|
||
| async def _run(self) -> None: | ||
| self._run_count += 1 | ||
| if self._fail_first_run and self._run_count == 1: | ||
| raise APIConnectionError("fake failure to trigger retry") | ||
| # emit a final and exit so _main_task can complete normally | ||
| self._event_ch.send_nowait( | ||
| SpeechEvent( | ||
| type=SpeechEventType.FINAL_TRANSCRIPT, | ||
| alternatives=[SpeechData(language="", text="hello")], | ||
| ) | ||
| ) | ||
|
|
||
|
|
||
| class _DummySTT(STT): | ||
| def __init__(self) -> None: | ||
| super().__init__(capabilities=STTCapabilities(streaming=True, interim_results=False)) | ||
|
|
||
| async def _recognize_impl(self, buffer: AudioBuffer, *, language, conn_options) -> SpeechEvent: | ||
| raise NotImplementedError | ||
|
|
||
| def stream(self, *, language=None, conn_options=DEFAULT_API_CONNECT_OPTIONS) -> _DummyStream: | ||
| return _DummyStream(stt=self) | ||
|
|
||
|
|
||
| async def test_start_time_seeded_on_init() -> None: | ||
| """start_time is initialized to approximately time.time() when the stream is created.""" | ||
| stt = _DummySTT() | ||
| before = time.time() | ||
| stream = stt.stream() | ||
| after = time.time() | ||
|
|
||
| assert before <= stream.start_time <= after | ||
| await stream.aclose() | ||
|
|
||
|
|
||
| async def test_start_time_setter_accepts_valid_values() -> None: | ||
| """Plugins can override start_time by assigning to the public property.""" | ||
| stt = _DummySTT() | ||
| stream = stt.stream() | ||
|
|
||
| new_anchor = time.time() + 10.0 | ||
| stream.start_time = new_anchor | ||
| assert stream.start_time == new_anchor | ||
|
|
||
| await stream.aclose() | ||
|
|
||
|
|
||
| async def test_start_time_setter_rejects_negative() -> None: | ||
| """start_time setter validates non-negative, matching start_time_offset behavior.""" | ||
| stt = _DummySTT() | ||
| stream = stt.stream() | ||
|
|
||
| with pytest.raises(ValueError, match="start_time must be non-negative"): | ||
| stream.start_time = -1.0 | ||
|
|
||
| await stream.aclose() | ||
|
|
||
|
|
||
| async def test_start_time_reseeded_on_retry() -> None: | ||
| """When _main_task retries after an APIError, start_time is re-seeded so plugin | ||
| overrides from the previous connection don't leak into the new one.""" | ||
| stt = _DummySTT() | ||
| stream = _DummyStream(stt=stt, fail_first_run=True) | ||
|
|
||
| # Simulate a plugin overriding start_time during the first (failing) _run() | ||
| # by assigning a sentinel value before the task picks up. | ||
| sentinel = 1.0 | ||
| stream.start_time = sentinel | ||
|
|
||
| # Let the main task run: it should retry past the first-run APIError, and | ||
| # on each attempt re-seed start_time to a fresh time.time() value before | ||
| # _run() is called. | ||
| await asyncio.wait_for(stream._task, timeout=5.0) | ||
|
|
||
| # After the retry, start_time must have been re-seeded (not equal to sentinel). | ||
| assert stream.start_time != sentinel | ||
| # And it should be a recent wall-clock value. | ||
| assert time.time() - stream.start_time < 5.0 | ||
|
|
||
| await stream.aclose() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add a condition where:
self._speech_start_time = ev.speech_start_time if ev.speech_start_time < self._speech_start_time else self._speech_start_timefor when the vad detects activity before the stt as well
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to this! Just want to flag it changes behavior from the current PR. Two shapes:
_speech_start_timeis only set from STT when VAD hasn't already set it. VAD wins when it fires, preserving current behavior.I leaned toward #1 since local VAD's back-date is usually more accurate than the server timestamp (no network delay, no clock skew) plus less of a behavioral change (in relation to what currently exists), but happy to flip to #2 if you think the "STT caught it earlier" case is common enough to trust by default.
Let me know which shape the team prefers!