Fix dual-source timestamp inflation in meeting mode#330
Fix dual-source timestamp inflation in meeting mode#330materemias wants to merge 2 commits intomainfrom
Conversation
The per-chunk start offset was derived from transcript.duration_ms(), which returns the max end_ms across all segments — conflating mic and loopback. On dual-track captures, each new loopback chunk got pushed past the last mic segment and vice versa, roughly doubling apparent elapsed time (e.g. an 80-minute meeting exporting as 2:17). Track the offset per AudioSource so mic and loopback each maintain their own wall-clock-aligned timeline, advancing by the actual audio duration consumed each chunk. Pad loopback chunks in the daemon to the mic chunk length with silence before dispatch. If the loopback monitor lags at startup or briefly gaps, the padding keeps both source offsets advancing in lockstep with wall-clock time; VAD still skips transcription on silent frames. Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Pull request overview
Fixes meeting-mode timestamp inflation on dual-source (mic + loopback) captures by tracking per-source time offsets instead of deriving per-chunk offsets from the overall transcript duration.
Changes:
- Track and advance a per-
AudioSourcecumulative offset inMeetingDaemonwhen creating new chunk buffers. - Clear per-source offsets on meeting stop to avoid cross-meeting contamination.
- In the daemon meeting loop, drain loopback alongside mic and (currently) pad loopback to mic chunk size, then always dispatch loopback chunks to keep offsets advancing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/meeting/mod.rs | Introduces per-source offset tracking and uses it to compute chunk start offsets; advances offsets by processed audio duration. |
| src/daemon.rs | Adjusts meeting chunk dispatch to pad loopback chunks to mic length and always process loopback chunks to keep offsets aligned. |
| @@ -319,6 +325,12 @@ impl MeetingDaemon { | |||
| .process_chunk(buffer) | |||
| .map_err(crate::error::VoxtypeError::Transcribe)?; | |||
|
|
|||
| // Advance the per-source offset by the actual audio duration consumed, | |||
| // regardless of whether VAD found speech in this chunk. | |||
| if let Some(offset) = self.source_offsets.get_mut(&source) { | |||
| *offset += result.audio_duration_ms; | |||
| } | |||
There was a problem hiding this comment.
The per-source offset is only advanced after process_chunk succeeds. If transcription/VAD processing returns an error, the caller (daemon.rs) has already drained and discarded the audio, but source_offsets won’t move forward—subsequent segments will reuse the same start_offset_ms and produce overlapping/incorrect timestamps. Consider advancing the offset based on the input samples length (and configured sample_rate) even on error paths, or computing audio_duration_ms upfront and updating source_offsets in a finally-style block before returning the error.
| @@ -319,6 +325,12 @@ impl MeetingDaemon { | |||
| .process_chunk(buffer) | |||
| .map_err(crate::error::VoxtypeError::Transcribe)?; | |||
|
|
|||
| // Advance the per-source offset by the actual audio duration consumed, | |||
| // regardless of whether VAD found speech in this chunk. | |||
| if let Some(offset) = self.source_offsets.get_mut(&source) { | |||
| *offset += result.audio_duration_ms; | |||
| } | |||
There was a problem hiding this comment.
New per-source offset behavior is central to meeting timestamp correctness, but there’s no unit test exercising process_chunk_with_source with interleaved sources to ensure offsets don’t inflate (and that silent/VAD-skipped chunks still advance time). Since this file already has tests, please add a focused test using a lightweight in-module Transcriber stub and fixed-size sample buffers to assert that: (1) mic and loopback maintain independent offsets, and (2) timestamps advance by chunk audio duration rather than transcript.duration_ms().
| // Also drain loopback buffer up to the same amount. | ||
| // Pad with silence so it matches the mic chunk length — | ||
| // keeps the per-source timestamp offsets advancing in | ||
| // lockstep with wall-clock even if the loopback monitor | ||
| // has a short startup lag or a transient gap. | ||
| let loopback_len = self.meeting_loopback_buffer.len().min(chunk_samples); | ||
| let loopback_chunk: Vec<f32> = self.meeting_loopback_buffer.drain(..loopback_len).collect(); | ||
| let mut loopback_chunk: Vec<f32> = self.meeting_loopback_buffer.drain(..loopback_len).collect(); | ||
| if loopback_chunk.len() < chunk_samples { | ||
| loopback_chunk.resize(chunk_samples, 0.0); | ||
| } |
There was a problem hiding this comment.
Padding loopback_chunk up to chunk_samples can cause the current VAD implementation to miss real loopback speech when the loopback buffer is lagging. VoiceActivityDetector::contains_speech requires >10% of frames to be above threshold; if you pad a short (e.g., 1–2s) loopback slice to a 30s chunk, the speech-frame ratio is diluted and the entire chunk is treated as “no speech”, dropping that audio from transcription. A safer approach is to run VAD/transcription on the actual loopback samples and advance the loopback offset separately (e.g., by adding an explicit “advance offset by N ms” path when loopback is short/empty), rather than padding before VAD.
Copilot review on #330 flagged two real issues with the initial fix: 1. Padding the loopback chunk to mic length with silence diluted the VAD speech-frame ratio, risking dropping real remote audio when the loopback monitor was lagging — exactly the case the padding was meant to help. 2. The per-source offset was only advanced after process_chunk succeeded, so on a transcription/VAD error the caller had already drained the samples but the offset stayed put — subsequent chunks would reuse the same start_offset_ms and overlap. Changes: - Advance source_offsets up front from the input sample count, before any fallible processing, so offsets track wall-clock even on errors. - Drop the silence padding in the meeting chunk dispatch and restore the original skip-when-empty guard. - Add MeetingDaemon::sync_source_offsets() that levels all sources to the furthest-advanced one, and call it after each mic+loopback iteration so a short or skipped loopback chunk catches up before the next one is dispatched. Co-authored-by: Copilot <copilot-pull-request-reviewer[bot]@users.noreply.github.com>
|
Addressed Copilot review in 431a5f5. Fixed (P1) — Fixed (P2) — Added — Not addressed — the unit-test suggestion. Adding a focused test for this path would require building a mock |
|
This initially looks great. I need a few days to give it a closer look and test. THANK YOU |
Summary
transcript.duration_ms(), which returns the maxend_msacross all segments — conflating mic and loopback on dual-track captures. Every new chunk got pushed past the other source's end, roughly doubling apparent meeting length (observed: 80-minute meeting exporting as 2h17m).AudioSourceso mic and loopback each advance on their own wall-clock timeline, by the actual audio duration consumed each chunk.daemon.rsto the mic chunk length with silence before dispatch — keeps both offsets advancing in lockstep if the loopback monitor lags at startup or briefly gaps. VAD still skips transcription on silent frames.Why
Repro: run
voxtype meeting export latest --format markdown --timestampson any meeting where both sides speak. Real-time captures with ~1h20m of audio were exporting transcript timestamps up to[02:17:29].Test plan
cargo build --libcargo test --lib meeting::— 175 passingcargo clippy --lib --no-deps— no new warningsCo-authored-by: Codex noreply@openai.com