Add corpus capture for post-processing training#328
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an optional “corpus capture” feature that persistently saves successful push-to-talk sessions (audio + text stages + JSON metadata) to disk for downstream post-processing training/evaluation.
Changes:
- Introduces
src/corpuswithCorpusWriterto write.wav/.txt/.jsonsession artifacts. - Wires corpus capture into the daemon transcription pipeline (fire-and-forget via
spawn_blocking) and adds config/env/CLI overrides. - Documents corpus configuration and usage in the user manual, troubleshooting guide, and configuration reference.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main.rs | Adds env/CLI overrides for corpus enablement and output path. |
| src/lib.rs | Exposes the new corpus module. |
| src/daemon.rs | Initializes a CorpusWriter and saves session artifacts after successful transcription/post-processing. |
| src/corpus/mod.rs | Implements corpus artifact writing (WAV + stage texts + JSON sidecar) and unit tests. |
| src/config.rs | Adds [corpus] config, defaults, and a stable TranscriptionEngine::name() identifier. |
| src/cli.rs | Adds --corpus, --no-corpus, and --corpus-path flags plus parsing tests. |
| docs/USER_MANUAL.md | Adds a “Building a Training Corpus” section. |
| docs/TROUBLESHOOTING.md | Adds troubleshooting steps for missing corpus files. |
| docs/CONFIGURATION.md | Documents the new [corpus] section and file outputs. |
Autosaves each successful push-to-talk session as an (audio, raw, processed, post, metadata) tuple into a flat directory so users can build a training/evaluation corpus for LLM post-processing. Highlights: - New src/corpus module with CorpusWriter + sidecar JSON schema - 16 kHz mono int16 WAV of the exact audio the transcriber saw (eager-mode path included) - Conditional .processed.txt / .post.txt (elided when equal or absent) - Records active engine + model, language (null on auto-detect), profile, and post-process command name - [corpus] config section, VOXTYPE_CORPUS_ENABLED / VOXTYPE_CORPUS_PATH env vars, and --corpus / --no-corpus / --corpus-path CLI flags - Fire-and-forget saves via spawn_blocking; failures are logged only and never block dictation - Disabled by default; fully backwards compatible - Docs: CONFIGURATION.md [corpus] section, USER_MANUAL.md "Building a Training Corpus", TROUBLESHOOTING.md entry Related to #209 (audio + transcription history) and partially overlaps with #28 (audio caching). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6e73af8 to
3736a6d
Compare
|
Addressed all 8 Copilot review comments in a force-pushed amend (single signed commit):
All 555 lib tests still pass (including the 19 corpus + config::test_corpus + cli corpus parse tests). |
|
Ready for another look when you have a moment, @copilot-pull-request-reviewer. The force-pushed head is |
|
reviewing, building, testing for a few days |
Summary
Autosaves each successful push-to-talk session as an
(audio, raw, processed, post, metadata)tuple into a flat directory, so users can iteratively build a training/evaluation corpus for the LLM post-processing stage.What's in the box
src/corpusmodule withCorpusWriterand a documented JSON sidecar schema.processed.txt/.post.txtfiles (elided when equal to raw or absent)nullon auto-detect), profile, and post-process command name[corpus]config section,VOXTYPE_CORPUS_ENABLED/VOXTYPE_CORPUS_PATHenv vars,--corpus/--no-corpus/--corpus-pathCLI flagsspawn_blocking; failures logged only, never block dictationLayout
Related issues
Test plan
cargo test --lib corpus— 16 passedcargo test --lib config::tests::test_corpus— 3 passedcargo test --lib corpus_flags / corpus_and_no_corpus— 2 passedvoxtype --helpshows theCorpus:section with all 3 flagswarn!+ no crashDocs
docs/CONFIGURATION.md— new[corpus]sectiondocs/USER_MANUAL.md— "Building a Training Corpus" subsectiondocs/TROUBLESHOOTING.md— "Corpus files aren't appearing" entry🤖 Generated with Claude Code