Skip to content

metjuperry/said

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

said

Your meetings said things. said finds them.

A CLI tool and agent toolkit for searching, slicing, and querying .vtt and .srt meeting transcripts. Ask what was discussed, who said it, and when — from the command line or directly from your AI agent.


What it does

said treats transcripts as a queryable data source, not just text files to convert:

  • Search — find where a topic was discussed with exact, prefix, and fuzzy matching across morphologically rich languages
  • Slice — extract a time window, a single speaker's contributions, or the section around a timestamp
  • BM25 ranking — run natural-language queries against an indexed transcript; get the most relevant speaker runs ranked by relevance score
  • Stats — get speaker list, talk-time breakdown, and duration before deciding how to query
  • Prepare — when you need the full cleaned transcript: strips filler words, merges cues into paragraphs, collapses STT artifacts, fixes overlapping caption interleaving

All commands are -y (non-interactive) by default when piped, making said a first-class tool for agentic workflows.


Agent use

Six skills are provided in .agent/skills/ for use with GitHub Copilot CLI, Claude Code, or any compatible agent framework:

Skill When to invoke
said-stats First — get speakers, duration, talk-time before querying
said-search "Was X mentioned?", "when was Y first raised?", exact/fuzzy term lookup
said-embed-search "What was discussed about Z?" — conceptual BM25 queries
said-embed-index Pre-build the BM25 index for a transcript
said-slice "What did Alice say", "what happened around minute 15"
said-prepare Full cleaned transcript for summarization or action item extraction

Recommended workflow

# 1. Orient: who spoke, how long
said stats meeting.vtt -y

# 2. Find the relevant section
said search meeting.vtt -y --lang cs -s "budget,deadline" --context 2 --compact

# 3. Read the full surrounding context
said slice meeting.vtt -y --lang cs --from "00:23" --to "00:31" --compact

# 4. Or run a conceptual BM25 query
said embed meeting.vtt --query "concerns about the timeline" --top 5 -y

Installation

Requires .NET 10 SDK.

git clone <repo-url>
cd transcript-cleanup
make install

Or manually:

dotnet pack Said/Said.csproj -o ./nupkg
dotnet tool install -g --add-source ./nupkg said

Verify:

said --version

CLI reference

said prepare — full cleaned transcript

said prepare <input-file> [options]
Flag Default Description
-o, --output <input>.md Output path. -o - forces stdout.
-t, --title filename Meeting title shown in the header.
-d, --date today Meeting date YYYY-MM-DD.
-p, --participants inferred "Alice (PM), Bob (Eng)" — canonicalizes speaker names.
-l, --lang en Filler-word preset: en, cs, sk, de, fr, es.
--fillers Custom comma-separated fillers, overrides --lang.
--keep-timestamps off Prepend [HH:MM] to each speaker block.
--chunk-size 0 Split into ## Part N of M sections at speaker boundaries.
--compact off Single-line SPEAKER: text format (~15% fewer tokens).
-y, --yes off Non-interactive. Skips all prompts.

said search — keyword search with timestamps

said search <input-file> -s "<terms>" [options]
Flag Default Description
-s, --search required Comma-separated terms.
--context 2 Speaker-runs of context around each hit.
--compact off Single-line output.
--first off Return only the first hit per term.
--last off Return only the last hit per term.
-l, --lang en Filler preset.
-y off Non-interactive.

Search uses three passes per token: exact whole-word → prefix (testtestovací) → fuzzy Levenshtein (≥78–85%). Timestamps marked `00:05` are exact; `~00:05` are prefix/fuzzy.

said slice — extract by time, speaker, or question type

said slice <input-file> [filters] [options]
Flag Default Description
--from start Start of window HH:MM.
--to end End of window HH:MM.
--around Center timestamp; use with --window.
--window 6 Runs either side when using --around.
--speaker all Comma-separated speaker name(s), substring-matched.
--questions off Extract only turns that contain a question.
--compact off Single-line output.
-l, --lang en Filler preset.
-y off Non-interactive.

said stats — metadata and speaker breakdown

said stats <input-file> [-l <lang>] [-y]

Returns duration, participant list, and per-speaker word count and talk-time share. Use before other commands to identify speakers and decide on chunking.

said embed — BM25 index and natural-language queries

said embed <input-file> [options]
Flag Default Description
-q, --query Natural-language query. If omitted, just builds the index.
--top 5 Maximum results to return.
--rebuild off Regenerate index even if .idx.json exists.
-l, --lang en Filler preset.
-y off Non-interactive.

The index is stored as <input>.idx.json alongside the source file. BM25 (Okapi BM25, k1 = 1.5, b = 0.75) ranks runs by term-frequency saturation and document-length normalisation — better than plain keyword matching for document-level retrieval. It does not understand synonyms; combine with said search for full coverage.


Language presets

Code Sample fillers
en um, uh, hmm, mhm, yeah, mm, hm
cs jo, no, jasně, ano, jj, dobře, přesně, hm, hmm, aha
sk jo, no, jasné, áno, dobre, presne, hm, hmm, aha
de ja, ok, okay, ähm, äh, mhm, hmm, genau, stimmt
fr euh, hm, hmm, ouais, mm, mhm, d'accord
es eh, hm, hmm, sí, mm, mhm, ajá, ok, vale

Build

make build      # dotnet build
make pack       # dotnet pack → ./nupkg/
make install    # pack + dotnet tool install -g
make uninstall  # dotnet tool uninstall -g said

Or use build.sh build|pack|install|uninstall.


Project structure

Said/
  Said.csproj
  Program.cs
  Models/
    CaptionCue.cs
    TranscriptOptions.cs
    ProcessedTranscript.cs
    SearchMatch.cs
    TranscriptIndex.cs    ← BM25 index model + IndexedRun record
  Parsing/
    ICaptionParser.cs
    VttParser.cs          ← Teams VTT with <v Speaker> tags
    SrtParser.cs          ← Zoom SRT with "Speaker:" prefixes
  Processing/
    NoiseSuppressor.cs    ← fillers, markers, word-doubling, dedup
    SpeakerNormalizer.cs  ← UPPERCASE, whitespace collapse, participant matching
    Merger.cs             ← cue → paragraph merge + backchannel filter
    Chunker.cs            ← word-count chunking at speaker boundaries
    Searcher.cs           ← exact / prefix / fuzzy search + context extract
    Bm25Indexer.cs        ← BM25 build + query; shared TF-IDF + tokenizer
  Output/
    MarkdownWriter.cs
.agent/skills/
  said-prepare/
  said-search/
  said-slice/
  said-stats/
  said-embed-index/
  said-embed-search/

Dependencies

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages