Your meetings said things. said finds them.
A CLI tool and agent toolkit for searching, slicing, and querying .vtt and .srt meeting transcripts. Ask what was discussed, who said it, and when — from the command line or directly from your AI agent.
said treats transcripts as a queryable data source, not just text files to convert:
- Search — find where a topic was discussed with exact, prefix, and fuzzy matching across morphologically rich languages
- Slice — extract a time window, a single speaker's contributions, or the section around a timestamp
- BM25 ranking — run natural-language queries against an indexed transcript; get the most relevant speaker runs ranked by relevance score
- Stats — get speaker list, talk-time breakdown, and duration before deciding how to query
- Prepare — when you need the full cleaned transcript: strips filler words, merges cues into paragraphs, collapses STT artifacts, fixes overlapping caption interleaving
All commands are -y (non-interactive) by default when piped, making said a first-class tool for agentic workflows.
Six skills are provided in .agent/skills/ for use with GitHub Copilot CLI, Claude Code, or any compatible agent framework:
| Skill | When to invoke |
|---|---|
said-stats |
First — get speakers, duration, talk-time before querying |
said-search |
"Was X mentioned?", "when was Y first raised?", exact/fuzzy term lookup |
said-embed-search |
"What was discussed about Z?" — conceptual BM25 queries |
said-embed-index |
Pre-build the BM25 index for a transcript |
said-slice |
"What did Alice say", "what happened around minute 15" |
said-prepare |
Full cleaned transcript for summarization or action item extraction |
# 1. Orient: who spoke, how long
said stats meeting.vtt -y
# 2. Find the relevant section
said search meeting.vtt -y --lang cs -s "budget,deadline" --context 2 --compact
# 3. Read the full surrounding context
said slice meeting.vtt -y --lang cs --from "00:23" --to "00:31" --compact
# 4. Or run a conceptual BM25 query
said embed meeting.vtt --query "concerns about the timeline" --top 5 -yRequires .NET 10 SDK.
git clone <repo-url>
cd transcript-cleanup
make installOr manually:
dotnet pack Said/Said.csproj -o ./nupkg
dotnet tool install -g --add-source ./nupkg saidVerify:
said --versionsaid prepare <input-file> [options]
| Flag | Default | Description |
|---|---|---|
-o, --output |
<input>.md |
Output path. -o - forces stdout. |
-t, --title |
filename | Meeting title shown in the header. |
-d, --date |
today | Meeting date YYYY-MM-DD. |
-p, --participants |
inferred | "Alice (PM), Bob (Eng)" — canonicalizes speaker names. |
-l, --lang |
en |
Filler-word preset: en, cs, sk, de, fr, es. |
--fillers |
— | Custom comma-separated fillers, overrides --lang. |
--keep-timestamps |
off | Prepend [HH:MM] to each speaker block. |
--chunk-size |
0 |
Split into ## Part N of M sections at speaker boundaries. |
--compact |
off | Single-line SPEAKER: text format (~15% fewer tokens). |
-y, --yes |
off | Non-interactive. Skips all prompts. |
said search <input-file> -s "<terms>" [options]
| Flag | Default | Description |
|---|---|---|
-s, --search |
required | Comma-separated terms. |
--context |
2 |
Speaker-runs of context around each hit. |
--compact |
off | Single-line output. |
--first |
off | Return only the first hit per term. |
--last |
off | Return only the last hit per term. |
-l, --lang |
en |
Filler preset. |
-y |
off | Non-interactive. |
Search uses three passes per token: exact whole-word → prefix (test → testovací) → fuzzy Levenshtein (≥78–85%). Timestamps marked `00:05` are exact; `~00:05` are prefix/fuzzy.
said slice <input-file> [filters] [options]
| Flag | Default | Description |
|---|---|---|
--from |
start | Start of window HH:MM. |
--to |
end | End of window HH:MM. |
--around |
— | Center timestamp; use with --window. |
--window |
6 |
Runs either side when using --around. |
--speaker |
all | Comma-separated speaker name(s), substring-matched. |
--questions |
off | Extract only turns that contain a question. |
--compact |
off | Single-line output. |
-l, --lang |
en |
Filler preset. |
-y |
off | Non-interactive. |
said stats <input-file> [-l <lang>] [-y]
Returns duration, participant list, and per-speaker word count and talk-time share. Use before other commands to identify speakers and decide on chunking.
said embed <input-file> [options]
| Flag | Default | Description |
|---|---|---|
-q, --query |
— | Natural-language query. If omitted, just builds the index. |
--top |
5 |
Maximum results to return. |
--rebuild |
off | Regenerate index even if .idx.json exists. |
-l, --lang |
en |
Filler preset. |
-y |
off | Non-interactive. |
The index is stored as <input>.idx.json alongside the source file. BM25 (Okapi BM25, k1 = 1.5, b = 0.75) ranks runs by term-frequency saturation and document-length normalisation — better than plain keyword matching for document-level retrieval. It does not understand synonyms; combine with said search for full coverage.
| Code | Sample fillers |
|---|---|
en |
um, uh, hmm, mhm, yeah, mm, hm |
cs |
jo, no, jasně, ano, jj, dobře, přesně, hm, hmm, aha |
sk |
jo, no, jasné, áno, dobre, presne, hm, hmm, aha |
de |
ja, ok, okay, ähm, äh, mhm, hmm, genau, stimmt |
fr |
euh, hm, hmm, ouais, mm, mhm, d'accord |
es |
eh, hm, hmm, sí, mm, mhm, ajá, ok, vale |
make build # dotnet build
make pack # dotnet pack → ./nupkg/
make install # pack + dotnet tool install -g
make uninstall # dotnet tool uninstall -g saidOr use build.sh build|pack|install|uninstall.
Said/
Said.csproj
Program.cs
Models/
CaptionCue.cs
TranscriptOptions.cs
ProcessedTranscript.cs
SearchMatch.cs
TranscriptIndex.cs ← BM25 index model + IndexedRun record
Parsing/
ICaptionParser.cs
VttParser.cs ← Teams VTT with <v Speaker> tags
SrtParser.cs ← Zoom SRT with "Speaker:" prefixes
Processing/
NoiseSuppressor.cs ← fillers, markers, word-doubling, dedup
SpeakerNormalizer.cs ← UPPERCASE, whitespace collapse, participant matching
Merger.cs ← cue → paragraph merge + backchannel filter
Chunker.cs ← word-count chunking at speaker boundaries
Searcher.cs ← exact / prefix / fuzzy search + context extract
Bm25Indexer.cs ← BM25 build + query; shared TF-IDF + tokenizer
Output/
MarkdownWriter.cs
.agent/skills/
said-prepare/
said-search/
said-slice/
said-stats/
said-embed-index/
said-embed-search/
- System.CommandLine — CLI parsing
- FuzzySharp — fuzzy string matching for search