[Feature] Speed up file-based auto-tag (preload + parallelize)#6848
[Feature] Speed up file-based auto-tag (preload + parallelize)#6848XTVB wants to merge 2 commits intostashapp:developfrom
Conversation
Replaces the per-file SQL QueryForAutoTag prefilter with an in-memory
2-rune prefix index over performers/studios/tags, preloaded once at job
start. Also:
- runs file processing through job.TaskQueue so scenes/images/
galleries tag in parallel instead of one file at a time
- keyset-paginates the query loop so batch N+1 doesn't pay the
O(offset) scan past large tables
- bulk-loads studio/tag aliases via a new optional AllAliasLoader
interface, avoiding N+1 GetAliases calls during preload
- caches compiled name regexps (same candidate names repeat across
thousands of files)
- hoists strings.ToLower(path) and allASCII(path) out of the per-
candidate match loop
- opens a fresh write txn per applied match instead of holding one
for every tagger phase
Tagger gains *AtPath methods that own the cache + txn manager, letting
the task code stay slim.
The preceding commit added lru.Cache for the compiled-regexp cache to match the style in pkg/sqlite/regex.go. That file's use case is different: a small bounded cache serving a read-dominated workload. The auto-tag regexp cache is job-scoped (so eviction buys us nothing) and hit by every worker on every candidate (so the LRU's per-Get mutex becomes contention, measurable under the parallel worker pool). sync.Map's read-optimised path avoids the contention without changing any observable behavior. Kept as a separate commit so it can be reverted independently if upstream prefers the LRU approach — the first commit stands on its own either way.
This is an insane amount of memory you're trying to allocate and your tables dont quite seem to line up nor extrapolate linearly. Can you provide your synthetic db and commit you're testing against? There was a memory leak in 0.31.0 that was patched |
|
Sure, here's the commit with the synthetic db/benchmarking script: 338c0b0 I should've already had the memory leak fix pulled in my fork when I ran it, I believe I had all the commits other than some minor documentation that got added, but happy to run the baseline again if you spot something silly I'm doing, I agree I was startled by the size of the allocations. To be clear the I think the allocation pressure is less relevant than the The speed improvements I'm quite confident about as on my local stash instance with performer-only auto-tagging where I have ~600 performers and ~70k files it went from taking ~70 seconds to taking ~3 seconds on non-synthetic data. I then ran the synthetic data benchmarking to validate if it was still useful on the full range of DB sizes. |
Scope
File-based auto-tag is O(files × entities) in the worst case, and on large libraries the per-file SQL prefilter, per-file regex recompile, and serial processing compound to make the job effectively unusable at very large scale (multi-hour runs at stashdb-comparable scale — see bench), and slower than it could be at smaller scales too. This PR keeps the matching semantics identical and attacks the constant factors.
Commits
Two commits, kept separate on purpose:
lru.Cacheto matchpkg/sqlite/regex.go's house style).sync.Map's read-optimised path avoids that. Kept as its own commit so it can be reverted cleanly if you'd prefer to keep style consistency withpkg/sqlite/regex.goand accept the slowdown.Idea
Auto-tag's hot loop is: for every file, ask the DB for performers/studios/tags whose name prefix looks like it could match a path word, then regex-check each candidate against the path. When you're tagging 100k files against 100k performers, that's 100k prefilter queries plus an enormous amount of repeated regex compilation and
strings.ToLowerover the same strings.Flip it: the candidate set doesn't change across files in one job, so preload it once, index it for fast per-path lookup, and let workers process files in parallel.
Design choices
map[2-rune-prefix] → []candidateover names and (for studios/tags) aliases. Per-file lookup then unions the candidate slices for each path word's 2-rune prefix. Mirrors the SQLname LIKE 'xx%'prefilter and feeds the same downstream regex matcher — so correctness is determined by the regex, not the prefix.singleFirstCharacterRegexfallback.PathTo{Performers,Studio,Tags}entry points. There's no new configuration surface and no behavior change for non-bulk callers.AllAliasLoaderalongsideAliasLoader. Callers type-assert; stores that don't implement it fall back to per-idGetAliases. Avoids an N+1 during preload without forcing every implementation to add a new method.job.TaskQueue. Worker count comes from the existingParallelTasksWithAutoDetectionsetting rather than a new knob.WHERE id > lastID ORDER BY idreplaces LIMIT/OFFSET so batch N doesn't pay an O(offset) scan past earlier rows. Without this, throughput degrades as the job progresses.(name, useUnicode)key.strings.ToLower(path)andallASCII(path)move out of the per-candidate inner loop into apathMatcherbuilt once per file.*AtPathmethods. They own the cache and the TxnManager, sotask_autotag.gostays thin. Tests were updated to exercise the new entry point; the old free functions are removed (the package isinternal/, so this has no external-API impact).Trade-off
The preloaded index holds every non-ignored performer/studio/tag + aliases + compiled regexps in memory for the duration of the job. On a stashdb-comparable library (~100k performers / 13k studios / 3k tags) this is ~573 MiB heap / ~724 MiB peak RSS — job-scoped, released after the run. The baseline uses almost no extra RAM but takes ~4 hours at that scale; this code finishes in ~2 minutes. See bench below.
Testing
Correctness is guarded at three layers:
pkg/match/path_semantic_test.go(new, ~425 lines). DrivesPathToPerformers,PathToStudio, andPathToTagsthrough the existing generated testify mocks inpkg/models/mocks(same infrastructure theinternal/autotag/*_test.gosuite uses viamocks.NewDatabase()) against representative inputs (plain name, separator variants, case-insensitivity, unicode, ignore_auto_tag, no-substring-match, multiple-match rightmost-wins, alias matching). Each case runs twice — once withcache=nil(the old SQL-prefilter path) and once with a preloaded cache — asserting identical output. This is the regression guard: if the preloaded path ever diverges from the SQL-prefilter path, the matching suite fails.pkg/match/cache_test.go(new). Unit tests forfirstTwoRunesLower(incl. unicode and single-rune edge cases), the regexp cache (hit/miss by(name, useUnicode)key, nil-cache fallback), and the preload + candidate-lookup path (prefix union across path words, always-check inclusion, ignore_auto_tag honored). Also uses the generatedpkg/models/mocks.internal/autotag/{scene,image,gallery}_test.go. Existing mock-based tagger tests were updated to drive the newTagger.*AtPathmethods with the same fixtures and assertions. Same regression coverage, new entry point.Benchmarks
Comparison between baseline — file-based auto-tag as it stood before the perf work — and new — preload + parallelize + bulk aliases.
Both runs use the same seeded synthetic DB content designed to match realistic use so that the DB is identical across baseline and new at each preset size. File types split 60/30/10 scenes/images/galleries. Match distribution: ~30% multi-match, ~50% single-match, ~20% no-match.
Hardware: MacBook (ARM, Darwin 25.2.0), Go 1.26.
runtime.MemStats.HeapInuseat end of run (post-GC of setup garbage)getrusage(RUSAGE_SELF).ru_maxrssat end of run — process-lifetime peaktiny — 100 performers · 20 studios · 50 tags · 1 000 files
Small absolute numbers, no regression at low scale — new is still meaningfully faster even when the preload overhead has little to amortize against.
small — 1 000 performers · 200 studios · 300 tags · 10 000 files
Allocation churn drops sharply (5.0 GiB → 461 MiB) — the prefix index and regex cache eliminate most of the per-file regex compilation & name string-lowering that the baseline pays on every file.
medium — 10 000 performers · 1 300 studios · 1 000 tags · 50 000 files
Baseline burns 187 GiB of transient allocations across 34 k GC cycles — symptom of the per-file
QueryForAutoTag+ regex-recompile + path-lowercase treadmill. The preload fixes all three.large — 100 000 performers · 13 000 studios · 3 000 tags · 100 000 files
Note on baseline: we let baseline-large run for 29:30 CPU time and checked DB state — it had processed 13 395 / 60 000 scenes (22.3%) and hadn't started images (30k) or galleries (10k). Extrapolating the observed ~7 scenes/sec rate, baseline-large would need ~3.5 more hours to complete (~4 hours total). We aborted to save compute. The ≥13× faster ratio uses only the ~30 min observed on baseline, not the extrapolated full run; with the full run the multiplier is ~60×.
Memory footprint at this scale: 573 MiB heap in use / 724 MiB peak RSS. The preloaded in-memory candidate set (100 k performers × pointer + prefix index + regex cache + all aliases) is the price for the throughput gain. For reference: stashdb's ~100 k performers / 13 k studios / 3 k tags fits comfortably in this envelope on any modern machine; the baseline is effectively unusable at this scale (4-hour auto-tag run) even though it uses little memory.
Summary
Speedup grows with scale. At the small end the preload + parallelism overhead is already paid back. At stashdb-comparable scale (large) the baseline is unusable for interactive auto-tagging; the new code finishes in 2 minutes.
Memory trade-off
Heap in use scales with entity count because the preloaded index holds all performers/studios/tags + their aliases + compiled regexps in memory:
The cache is job-scoped: it's held for the duration of one auto-tag run and released afterwards, so this doesn't inflate the stash process's steady-state RSS. A user with a 100 k-performer library paying 725 MiB peak RSS during a 2-minute auto-tag job (versus 4 hours with negligible extra RAM) is an unambiguous win.