Skip to content

feat(audio): add stream audio encoder for turn detection#5494

Open
chenghao-mou wants to merge 5 commits intomainfrom
chenghao/feat/streaming-encoder
Open

feat(audio): add stream audio encoder for turn detection#5494
chenghao-mou wants to merge 5 commits intomainfrom
chenghao/feat/streaming-encoder

Conversation

@chenghao-mou
Copy link
Copy Markdown
Member

@chenghao-mou chenghao-mou commented Apr 20, 2026

Added a stream audio encoder for turn detection, supporting opus, mp3, and pcm

Added a stream audio encoder for turn detection, supporting opus, mps, and pcm
@chenghao-mou chenghao-mou requested a review from a team April 20, 2026 06:28
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

return data


class AudioStreamEncoder:
Copy link
Copy Markdown
Member

@theomonnom theomonnom Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should encode in another thread, like we do for our AudioDecoder

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this before, but I can see some difference here:

Decoder: we need a thread so that the blocking read() wait doesn't stall the event loop

Encoder: caller pushes data (calling encode() when we have a frame) → no blocking wait, no thread needed

I can create a threaded version and show some benchmarks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the results:

metric sync threaded
push mean 1,186 us 40 us
push p95 2,053 us 45 us
push max 4,303 us 728 us
first page 4.0 ms 10.0 ms
inter-page mean 990 ms 989 ms
inter-page median 990 ms 990 ms
pages / bytes 7 / 1520 7 / 1520

Threaded version has a 6ms delay for the first page, but all of them are pretty much invisible in real-time load (60ms input frame size, opus needs about 16 frames for a page)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@theomonnom theomonnom Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Opus encode is almost instantaneous? Tho what if you push more than 60ms? like if you push 500ms?
isn't it going to block? I understand we will push tiny frames for the barge-in model, but since this is a public utility, we still need to get the interface right

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync version is still blocking 4ms sometimes, for the asyncio it's still not ideal (it accumulates with the user code and a lot of stuff inside our framework).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, reading this comment #5494 (comment)

Seems like we should close this PR then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants