said

Your meetings said things. said finds them.

A CLI tool and agent toolkit for searching, slicing, and querying .vtt and .srt meeting transcripts. Ask what was discussed, who said it, and when — from the command line or directly from your AI agent.

What it does

said treats transcripts as a queryable data source, not just text files to convert:

Search — find where a topic was discussed with exact, prefix, and fuzzy matching across morphologically rich languages
Slice — extract a time window, a single speaker's contributions, or the section around a timestamp
BM25 ranking — run natural-language queries against an indexed transcript; get the most relevant speaker runs ranked by relevance score
Stats — get speaker list, talk-time breakdown, and duration before deciding how to query
Prepare — when you need the full cleaned transcript: strips filler words, merges cues into paragraphs, collapses STT artifacts, fixes overlapping caption interleaving

All commands are -y (non-interactive) by default when piped, making said a first-class tool for agentic workflows.

Agent use

Six skills are provided in .agent/skills/ for use with GitHub Copilot CLI, Claude Code, or any compatible agent framework:

Skill	When to invoke
`said-stats`	First — get speakers, duration, talk-time before querying
`said-search`	"Was X mentioned?", "when was Y first raised?", exact/fuzzy term lookup
`said-embed-search`	"What was discussed about Z?" — conceptual BM25 queries
`said-embed-index`	Pre-build the BM25 index for a transcript
`said-slice`	"What did Alice say", "what happened around minute 15"
`said-prepare`	Full cleaned transcript for summarization or action item extraction

Recommended workflow

# 1. Orient: who spoke, how long
said stats meeting.vtt -y

# 2. Find the relevant section
said search meeting.vtt -y --lang cs -s "budget,deadline" --context 2 --compact

# 3. Read the full surrounding context
said slice meeting.vtt -y --lang cs --from "00:23" --to "00:31" --compact

# 4. Or run a conceptual BM25 query
said embed meeting.vtt --query "concerns about the timeline" --top 5 -y

Installation

Requires .NET 10 SDK.

git clone <repo-url>
cd transcript-cleanup
make install

Or manually:

dotnet pack Said/Said.csproj -o ./nupkg
dotnet tool install -g --add-source ./nupkg said

Verify:

said --version

CLI reference

`said prepare` — full cleaned transcript

said prepare <input-file> [options]

Flag	Default	Description
`-o, --output`	`<input>.md`	Output path. `-o -` forces stdout.
`-t, --title`	filename	Meeting title shown in the header.
`-d, --date`	today	Meeting date `YYYY-MM-DD`.
`-p, --participants`	inferred	`"Alice (PM), Bob (Eng)"` — canonicalizes speaker names.
`-l, --lang`	`en`	Filler-word preset: `en`, `cs`, `sk`, `de`, `fr`, `es`.
`--fillers`	—	Custom comma-separated fillers, overrides `--lang`.
`--keep-timestamps`	off	Prepend `[HH:MM]` to each speaker block.
`--chunk-size`	`0`	Split into `## Part N of M` sections at speaker boundaries.
`--compact`	off	Single-line `SPEAKER: text` format (~15% fewer tokens).
`-y, --yes`	off	Non-interactive. Skips all prompts.

`said search` — keyword search with timestamps

said search <input-file> -s "<terms>" [options]

Flag	Default	Description
`-s, --search`	required	Comma-separated terms.
`--context`	`2`	Speaker-runs of context around each hit.
`--compact`	off	Single-line output.
`--first`	off	Return only the first hit per term.
`--last`	off	Return only the last hit per term.
`-l, --lang`	`en`	Filler preset.
`-y`	off	Non-interactive.

Search uses three passes per token: exact whole-word → prefix (test → testovací) → fuzzy Levenshtein (≥78–85%). Timestamps marked `00:05` are exact; `~00:05` are prefix/fuzzy.

`said slice` — extract by time, speaker, or question type

said slice <input-file> [filters] [options]

Flag	Default	Description
`--from`	start	Start of window `HH:MM`.
`--to`	end	End of window `HH:MM`.
`--around`	—	Center timestamp; use with `--window`.
`--window`	`6`	Runs either side when using `--around`.
`--speaker`	all	Comma-separated speaker name(s), substring-matched.
`--questions`	off	Extract only turns that contain a question.
`--compact`	off	Single-line output.
`-l, --lang`	`en`	Filler preset.
`-y`	off	Non-interactive.

`said stats` — metadata and speaker breakdown

said stats <input-file> [-l <lang>] [-y]

Returns duration, participant list, and per-speaker word count and talk-time share. Use before other commands to identify speakers and decide on chunking.

`said embed` — BM25 index and natural-language queries

said embed <input-file> [options]

Flag	Default	Description
`-q, --query`	—	Natural-language query. If omitted, just builds the index.
`--top`	`5`	Maximum results to return.
`--rebuild`	off	Regenerate index even if `.idx.json` exists.
`-l, --lang`	`en`	Filler preset.
`-y`	off	Non-interactive.

The index is stored as <input>.idx.json alongside the source file. BM25 (Okapi BM25, k1 = 1.5, b = 0.75) ranks runs by term-frequency saturation and document-length normalisation — better than plain keyword matching for document-level retrieval. It does not understand synonyms; combine with said search for full coverage.

Language presets

Code	Sample fillers
`en`	um, uh, hmm, mhm, yeah, mm, hm
`cs`	jo, no, jasně, ano, jj, dobře, přesně, hm, hmm, aha
`sk`	jo, no, jasné, áno, dobre, presne, hm, hmm, aha
`de`	ja, ok, okay, ähm, äh, mhm, hmm, genau, stimmt
`fr`	euh, hm, hmm, ouais, mm, mhm, d'accord
`es`	eh, hm, hmm, sí, mm, mhm, ajá, ok, vale

Build

make build      # dotnet build
make pack       # dotnet pack → ./nupkg/
make install    # pack + dotnet tool install -g
make uninstall  # dotnet tool uninstall -g said

Or use build.sh build|pack|install|uninstall.

Project structure

Said/
  Said.csproj
  Program.cs
  Models/
    CaptionCue.cs
    TranscriptOptions.cs
    ProcessedTranscript.cs
    SearchMatch.cs
    TranscriptIndex.cs    ← BM25 index model + IndexedRun record
  Parsing/
    ICaptionParser.cs
    VttParser.cs          ← Teams VTT with <v Speaker> tags
    SrtParser.cs          ← Zoom SRT with "Speaker:" prefixes
  Processing/
    NoiseSuppressor.cs    ← fillers, markers, word-doubling, dedup
    SpeakerNormalizer.cs  ← UPPERCASE, whitespace collapse, participant matching
    Merger.cs             ← cue → paragraph merge + backchannel filter
    Chunker.cs            ← word-count chunking at speaker boundaries
    Searcher.cs           ← exact / prefix / fuzzy search + context extract
    Bm25Indexer.cs        ← BM25 build + query; shared TF-IDF + tokenizer
  Output/
    MarkdownWriter.cs
.agent/skills/
  said-prepare/
  said-search/
  said-slice/
  said-stats/
  said-embed-index/
  said-embed-search/

Dependencies

System.CommandLine — CLI parsing
FuzzySharp — fuzzy string matching for search

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.agent/skills		.agent/skills
Said		Said
test-samples		test-samples
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

said

What it does

Agent use

Recommended workflow

Installation

CLI reference

`said prepare` — full cleaned transcript

`said search` — keyword search with timestamps

`said slice` — extract by time, speaker, or question type

`said stats` — metadata and speaker breakdown

`said embed` — BM25 index and natural-language queries

Language presets

Build

Project structure

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

said

What it does

Agent use

Recommended workflow

Installation

CLI reference

said prepare — full cleaned transcript

said search — keyword search with timestamps

said slice — extract by time, speaker, or question type

said stats — metadata and speaker breakdown

said embed — BM25 index and natural-language queries

Language presets

Build

Project structure

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`said prepare` — full cleaned transcript

`said search` — keyword search with timestamps

`said slice` — extract by time, speaker, or question type

`said stats` — metadata and speaker breakdown

`said embed` — BM25 index and natural-language queries

Packages