feat(audio): add stream audio encoder for turn detection#5494
feat(audio): add stream audio encoder for turn detection#5494chenghao-mou wants to merge 5 commits intomainfrom
Conversation
Added a stream audio encoder for turn detection, supporting opus, mps, and pcm
| return data | ||
|
|
||
|
|
||
| class AudioStreamEncoder: |
There was a problem hiding this comment.
We should encode in another thread, like we do for our AudioDecoder
There was a problem hiding this comment.
I thought about this before, but I can see some difference here:
Decoder: we need a thread so that the blocking read() wait doesn't stall the event loop
Encoder: caller pushes data (calling encode() when we have a frame) → no blocking wait, no thread needed
I can create a threaded version and show some benchmarks.
There was a problem hiding this comment.
Here are the results:
| metric | sync | threaded |
|---|---|---|
| push mean | 1,186 us | 40 us |
| push p95 | 2,053 us | 45 us |
| push max | 4,303 us | 728 us |
| first page | 4.0 ms | 10.0 ms |
| inter-page mean | 990 ms | 989 ms |
| inter-page median | 990 ms | 990 ms |
| pages / bytes | 7 / 1520 | 7 / 1520 |
Threaded version has a 6ms delay for the first page, but all of them are pretty much invisible in real-time load (60ms input frame size, opus needs about 16 frames for a page)
There was a problem hiding this comment.
BTW, I updated the eot PR to use the threaded version: https://github.com/livekit/agents/pull/4722/changes#diff-07d680088a7c2a58bad7bec653cc4d5197cc212269eb0d76d35eab64a1195b07
There was a problem hiding this comment.
So the Opus encode is almost instantaneous? Tho what if you push more than 60ms? like if you push 500ms?
isn't it going to block? I understand we will push tiny frames for the barge-in model, but since this is a public utility, we still need to get the interface right
There was a problem hiding this comment.
The sync version is still blocking 4ms sometimes, for the asyncio it's still not ideal (it accumulates with the user code and a lot of stuff inside our framework).
There was a problem hiding this comment.
Oh, reading this comment #5494 (comment)
Seems like we should close this PR then?
Added a stream audio encoder for turn detection, supporting opus, mp3, and pcm