-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp #5479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 8 commits
42b5625
f847401
d9f40a7
61b6ec1
ac2429a
fa991e1
2cbdc10
6060b87
9195c4d
8839283
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ | |
| import dataclasses | ||
| import json | ||
| import os | ||
| import time | ||
| import weakref | ||
| from dataclasses import dataclass | ||
| from typing import Literal | ||
|
|
@@ -282,6 +283,11 @@ def __init__( | |
| self._config_update_queue: asyncio.Queue[dict] = asyncio.Queue() | ||
| self._session_id: str | None = None | ||
| self._expires_at: int | None = None | ||
| # Wall-clock time (time.time()) when the first audio frame was sent to | ||
| # the server. Used to convert the server's stream-relative timestamp | ||
| # (returned in SpeechStarted.timestamp) into a wall-clock time so the | ||
| # framework can back-date _speech_start_time on START_OF_SPEECH. | ||
| self._stream_wall_start: float | None = None | ||
|
|
||
| @property | ||
| def session_id(self) -> str | None: | ||
|
|
@@ -356,6 +362,10 @@ def force_endpoint(self) -> None: | |
|
|
||
| async def _run(self) -> None: | ||
| """Run a single websocket connection to AssemblyAI.""" | ||
| # Reset on each (re)connection — the server's stream-relative timestamps | ||
| # restart at 0 with every new WebSocket, so the wall-clock anchor must | ||
| # also be re-captured from this connection's first frame. | ||
| self._stream_wall_start = None | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we have a field called start_time_offset in stt stream that plays a similar role, and it is assigned when the stream is initialized: stream.start_time_offset = time.time() - _audio_input_started_atI think we can add a second field
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call! Key consideration is that I believe "server-provided onset timestamp" would be anchored to whatever zero-point that provider defines, which will of course vary by each provider's sever-side implementation. Because of that, I was thinking that the framework can't reliably pin a single wall-clock moment that aligns with every provider's "zero" simultaneously (each plugin knows its own server's semantics and should probably own the anchoring moment). What about putting the field on the base class (shared, discoverable, other plugins can adopt), seeding a framework default at init so plugins that don't override still get some value, and letting each plugin overwrite it at whatever moment corresponds to its own server's zero? The framework can handle resetting it on retries centrally, same pattern as start_time_offset. Shape: # base class SpeechStream
self._start_time: float = time.time() # framework default
@property
def start_time(self) -> float: ...
@start_time.setter
def start_time(self, value: float) -> None: ...
# Plus a reset in _main_task across retries, same pattern as start_time_offset.What do you think? Edit: updated to seed a framework default and let plugins overwrite it, instead of leaving it as purely plugin-set.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds reasonable. The framework provides a default, and plugins can override it if needed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Updated PR to reflect this |
||
| closing_ws = False | ||
|
|
||
| async def send_task(ws: aiohttp.ClientWebSocketResponse) -> None: | ||
|
|
@@ -378,6 +388,9 @@ async def send_task(ws: aiohttp.ClientWebSocketResponse) -> None: | |
| frames = audio_bstream.write(data.data.tobytes()) | ||
|
|
||
| for frame in frames: | ||
| if self._stream_wall_start is None: | ||
| # Anchor wall-clock time at first audio frame sent. | ||
| self._stream_wall_start = time.time() | ||
|
gsharp-aai marked this conversation as resolved.
Outdated
|
||
| self._speech_duration += frame.duration | ||
| await ws.send_bytes(frame.data.tobytes()) | ||
|
|
||
|
|
@@ -518,7 +531,21 @@ def _process_stream_event(self, data: dict) -> None: | |
| return | ||
|
|
||
| if message_type == "SpeechStarted": | ||
| self._event_ch.send_nowait(stt.SpeechEvent(type=stt.SpeechEventType.START_OF_SPEECH)) | ||
| # SpeechStarted can arrive well after actual speech onset. The | ||
| # `timestamp` field carries the server VAD's onset time in stream- | ||
| # relative ms. Convert to wall-clock by adding _stream_wall_start | ||
| # (recorded when the first audio frame was sent) so the framework | ||
| # records an accurate _speech_start_time instead of message arrival. | ||
| timestamp_ms = data.get("timestamp") | ||
| speech_start_time: float | None = None | ||
| if timestamp_ms is not None and self._stream_wall_start is not None: | ||
| speech_start_time = self._stream_wall_start + timestamp_ms / 1000 | ||
| self._event_ch.send_nowait( | ||
| stt.SpeechEvent( | ||
| type=stt.SpeechEventType.START_OF_SPEECH, | ||
| speech_start_time=speech_start_time, | ||
| ) | ||
| ) | ||
| return | ||
|
|
||
| if message_type == "Termination": | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add a condition where:
self._speech_start_time = ev.speech_start_time if ev.speech_start_time < self._speech_start_time else self._speech_start_timefor when the vad detects activity before the stt as well
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to this! Just want to flag it changes behavior from the current PR. Two shapes:
_speech_start_timeis only set from STT when VAD hasn't already set it. VAD wins when it fires, preserving current behavior.I leaned toward #1 since local VAD's back-date is usually more accurate than the server timestamp (no network delay, no clock skew) plus less of a behavioral change (in relation to what currently exists), but happy to flip to #2 if you think the "STT caught it earlier" case is common enough to trust by default.
Let me know which shape the team prefers!