Skip to content

Add corpus capture for post-processing training#328

Open
materemias wants to merge 1 commit intomainfrom
feature/corpus-capture
Open

Add corpus capture for post-processing training#328
materemias wants to merge 1 commit intomainfrom
feature/corpus-capture

Conversation

@materemias
Copy link
Copy Markdown
Collaborator

Summary

Autosaves each successful push-to-talk session as an (audio, raw, processed, post, metadata) tuple into a flat directory, so users can iteratively build a training/evaluation corpus for the LLM post-processing stage.

What's in the box

  • New src/corpus module with CorpusWriter and a documented JSON sidecar schema
  • 16 kHz mono int16 WAV of the exact audio the transcriber saw (eager-mode path included)
  • Conditional .processed.txt / .post.txt files (elided when equal to raw or absent)
  • Metadata records active engine + model, language (null on auto-detect), profile, and post-process command name
  • [corpus] config section, VOXTYPE_CORPUS_ENABLED / VOXTYPE_CORPUS_PATH env vars, --corpus / --no-corpus / --corpus-path CLI flags
  • Fire-and-forget saves via spawn_blocking; failures logged only, never block dictation
  • Disabled by default, fully backwards compatible

Layout

<corpus_path>/
  2026-04-20T14-32-05_a7f3.wav
  2026-04-20T14-32-05_a7f3.raw.txt
  2026-04-20T14-32-05_a7f3.processed.txt   # only if differs from raw
  2026-04-20T14-32-05_a7f3.post.txt        # only if post-processor ran
  2026-04-20T14-32-05_a7f3.json

Related issues

  • Related to [Feature] Support audio and output history #209 (audio + transcription history) — corpus stores the same artifacts (audio, raw, processed) in a flat, tooling-friendly layout. Not a full replacement (no replay UX, no retention policy yet), but covers the "keep it for later" half.
  • Partially overlaps [Feature] Cache audio input #28 (audio caching) — corpus keeps raw audio on disk; queueing inputs during transcription is still out of scope.

Test plan

  • cargo test --lib corpus — 16 passed
  • cargo test --lib config::tests::test_corpus — 3 passed
  • cargo test --lib corpus_flags / corpus_and_no_corpus — 2 passed
  • voxtype --help shows the Corpus: section with all 3 flags
  • Manual: enable via config, do 3 push-to-talk sessions (no post, post-changing, spoken-punctuation); verify file counts and JSON contents
  • Manual: disable, verify no new files
  • Manual: make corpus dir read-only mid-session, verify warn! + no crash

Docs

  • docs/CONFIGURATION.md — new [corpus] section
  • docs/USER_MANUAL.md — "Building a Training Corpus" subsection
  • docs/TROUBLESHOOTING.md — "Corpus files aren't appearing" entry

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 21, 2026 08:30
@materemias materemias requested a review from peteonrails as a code owner April 21, 2026 08:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional “corpus capture” feature that persistently saves successful push-to-talk sessions (audio + text stages + JSON metadata) to disk for downstream post-processing training/evaluation.

Changes:

  • Introduces src/corpus with CorpusWriter to write .wav/.txt/.json session artifacts.
  • Wires corpus capture into the daemon transcription pipeline (fire-and-forget via spawn_blocking) and adds config/env/CLI overrides.
  • Documents corpus configuration and usage in the user manual, troubleshooting guide, and configuration reference.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/main.rs Adds env/CLI overrides for corpus enablement and output path.
src/lib.rs Exposes the new corpus module.
src/daemon.rs Initializes a CorpusWriter and saves session artifacts after successful transcription/post-processing.
src/corpus/mod.rs Implements corpus artifact writing (WAV + stage texts + JSON sidecar) and unit tests.
src/config.rs Adds [corpus] config, defaults, and a stable TranscriptionEngine::name() identifier.
src/cli.rs Adds --corpus, --no-corpus, and --corpus-path flags plus parsing tests.
docs/USER_MANUAL.md Adds a “Building a Training Corpus” section.
docs/TROUBLESHOOTING.md Adds troubleshooting steps for missing corpus files.
docs/CONFIGURATION.md Documents the new [corpus] section and file outputs.

Comment thread src/daemon.rs Outdated
Comment thread src/daemon.rs Outdated
Comment thread src/corpus/mod.rs Outdated
Comment thread src/corpus/mod.rs Outdated
Comment thread src/cli.rs Outdated
Comment thread docs/USER_MANUAL.md Outdated
Comment thread docs/USER_MANUAL.md Outdated
Comment thread src/main.rs Outdated
Autosaves each successful push-to-talk session as an
(audio, raw, processed, post, metadata) tuple into a flat directory so
users can build a training/evaluation corpus for LLM post-processing.

Highlights:
- New src/corpus module with CorpusWriter + sidecar JSON schema
- 16 kHz mono int16 WAV of the exact audio the transcriber saw
  (eager-mode path included)
- Conditional .processed.txt / .post.txt (elided when equal or absent)
- Records active engine + model, language (null on auto-detect),
  profile, and post-process command name
- [corpus] config section, VOXTYPE_CORPUS_ENABLED / VOXTYPE_CORPUS_PATH
  env vars, and --corpus / --no-corpus / --corpus-path CLI flags
- Fire-and-forget saves via spawn_blocking; failures are logged only
  and never block dictation
- Disabled by default; fully backwards compatible
- Docs: CONFIGURATION.md [corpus] section, USER_MANUAL.md "Building a
  Training Corpus", TROUBLESHOOTING.md entry

Related to #209 (audio + transcription history) and partially overlaps
with #28 (audio caching).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@materemias materemias force-pushed the feature/corpus-capture branch from 6e73af8 to 3736a6d Compare April 21, 2026 08:54
@materemias
Copy link
Copy Markdown
Collaborator Author

Addressed all 8 Copilot review comments in a force-pushed amend (single signed commit):

  • CLI help (src/cli.rs): --corpus long_help now lists (audio, raw, processed, post, metadata) and notes when processed / post files are written.
  • USER_MANUAL (docs/USER_MANUAL.md): tuple description and file-list paragraph updated to include the conditional .processed.txt artifact.
  • env warn (src/main.rs): VOXTYPE_CORPUS_ENABLED rejection message now lists all accepted values (1/0, true/false, yes/no, on/off).
  • CorpusConfig doc (src/corpus/mod.rs): doc comment no longer claims path must be absolute; now states callers are expected to pre-resolve "auto" and both absolute/relative paths are accepted as-is. Also removed the stale docs/superpowers/... reference.
  • Eager-mode comment (src/daemon.rs): comment at the corpus capture site now describes the real reason for the empty-buffer guard (defensive fallback) and notes that the eager path threads its accumulated buffer through.
  • Atomic filename claim (src/corpus/mod.rs): save() now races via OpenOptions::create_new(true) on the .wav file; on AlreadyExists it retries with a fresh hex suffix (up to 3 attempts). write_wav takes the already-opened File so the claim and the write share one handle. Matching test updated.
  • model_override plumbing (src/daemon.rs): added pending_model_override: Option<String> to Daemon. Captured at the start of start_transcription_task before the Recording→Transcribing transition and consumed in the transcription-task-completion branch. Cleared on abort/cancel paths. Corpus sidecar model is now correct on the non-eager path too.

All 555 lib tests still pass (including the 19 corpus + config::test_corpus + cli corpus parse tests). cargo clippy --lib reports no new warnings in the touched files.

@materemias
Copy link
Copy Markdown
Collaborator Author

Ready for another look when you have a moment, @copilot-pull-request-reviewer. The force-pushed head is 3736a6d (signed). No changes outside the corpus feature scope; the model_override threading added one Daemon field and three touch points (capture, consume, clear-on-abort) but doesn't alter any existing control flow.

@peteonrails
Copy link
Copy Markdown
Owner

reviewing, building, testing for a few days

@peteonrails peteonrails added this to the 0.6.7 milestone Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants