[Feature] Speed up file-based auto-tag (preload + parallelize) by XTVB · Pull Request #6848 · stashapp/stash

XTVB · 2026-04-19T21:16:14Z

Scope

File-based auto-tag is O(files × entities) in the worst case, and on large libraries the per-file SQL prefilter, per-file regex recompile, and serial processing compound to make the job effectively unusable at very large scale (multi-hour runs at stashdb-comparable scale — see bench), and slower than it could be at smaller scales too. This PR keeps the matching semantics identical and attacks the constant factors.

Commits

Two commits, kept separate on purpose:

Speed up file-based auto-tag — the whole perf change: preload, prefix index, parallelism, keyset pagination, bulk alias loading, per-path precompute, regexp cache (as lru.Cache to match pkg/sqlite/regex.go's house style).
Use sync.Map instead of LRU for the per-job regexp cache — swaps only the cache data structure. The auto-tag cache is job-scoped (so LRU's eviction buys nothing) and is hit by every worker on every candidate, so the hashicorp LRU's per-Get mutex becomes a contention point under the parallel worker pool — measurable regression at the large preset. sync.Map's read-optimised path avoids that. Kept as its own commit so it can be reverted cleanly if you'd prefer to keep style consistency with pkg/sqlite/regex.go and accept the slowdown.

Idea

Auto-tag's hot loop is: for every file, ask the DB for performers/studios/tags whose name prefix looks like it could match a path word, then regex-check each candidate against the path. When you're tagging 100k files against 100k performers, that's 100k prefilter queries plus an enormous amount of repeated regex compilation and strings.ToLower over the same strings.

Flip it: the candidate set doesn't change across files in one job, so preload it once, index it for fast per-path lookup, and let workers process files in parallel.

Design choices

2-rune prefix index. The preload builds map[2-rune-prefix] → []candidate over names and (for studios/tags) aliases. Per-file lookup then unions the candidate slices for each path word's 2-rune prefix. Mirrors the SQL name LIKE 'xx%' prefilter and feeds the same downstream regex matcher — so correctness is determined by the regex, not the prefix.
Always-check list for single-letter-word names. A name like "X Man" wouldn't be reached by any 2-rune path-word key, so those entries stay in a separate slice that's always unioned in. Mirrors the existing singleFirstCharacterRegex fallback.
Preload is opt-in by code path, not config. The file-based auto-tag job preloads; everything else (built-in scraper, single-scene tagging) still takes the old per-call SQL-prefilter path via the same PathTo{Performers,Studio,Tags} entry points. There's no new configuration surface and no behavior change for non-bulk callers.
Bulk alias loading via an optional interface. Added AllAliasLoader alongside AliasLoader. Callers type-assert; stores that don't implement it fall back to per-id GetAliases. Avoids an N+1 during preload without forcing every implementation to add a new method.
Parallelism reuses job.TaskQueue. Worker count comes from the existing ParallelTasksWithAutoDetection setting rather than a new knob.
Keyset pagination in the scene/image/gallery query loop. WHERE id > lastID ORDER BY id replaces LIMIT/OFFSET so batch N doesn't pay an O(offset) scan past earlier rows. Without this, throughput degrades as the job progresses.
Short-lived write transactions. One write txn per applied match instead of one long-held txn wrapping each tagger phase. Keeps contention low under the worker pool.
Compiled-regexp cache + per-path precompute. Same candidate names repeat across thousands of files, so compile once per (name, useUnicode) key. strings.ToLower(path) and allASCII(path) move out of the per-candidate inner loop into a pathMatcher built once per file.
Tagger gains *AtPath methods. They own the cache and the TxnManager, so task_autotag.go stays thin. Tests were updated to exercise the new entry point; the old free functions are removed (the package is internal/, so this has no external-API impact).

Trade-off

The preloaded index holds every non-ignored performer/studio/tag + aliases + compiled regexps in memory for the duration of the job. On a stashdb-comparable library (~100k performers / 13k studios / 3k tags) this is ~573 MiB heap / ~724 MiB peak RSS — job-scoped, released after the run. The baseline uses almost no extra RAM but takes ~4 hours at that scale; this code finishes in ~2 minutes. See bench below.

Testing

Correctness is guarded at three layers:

pkg/match/path_semantic_test.go (new, ~425 lines). Drives PathToPerformers, PathToStudio, and PathToTags through the existing generated testify mocks in pkg/models/mocks (same infrastructure the internal/autotag/*_test.go suite uses via mocks.NewDatabase()) against representative inputs (plain name, separator variants, case-insensitivity, unicode, ignore_auto_tag, no-substring-match, multiple-match rightmost-wins, alias matching). Each case runs twice — once with cache=nil (the old SQL-prefilter path) and once with a preloaded cache — asserting identical output. This is the regression guard: if the preloaded path ever diverges from the SQL-prefilter path, the matching suite fails.
pkg/match/cache_test.go (new). Unit tests for firstTwoRunesLower (incl. unicode and single-rune edge cases), the regexp cache (hit/miss by (name, useUnicode) key, nil-cache fallback), and the preload + candidate-lookup path (prefix union across path words, always-check inclusion, ignore_auto_tag honored). Also uses the generated pkg/models/mocks.
internal/autotag/{scene,image,gallery}_test.go. Existing mock-based tagger tests were updated to drive the new Tagger.*AtPath methods with the same fixtures and assertions. Same regression coverage, new entry point.

Benchmarks

Comparison between baseline — file-based auto-tag as it stood before the perf work — and new — preload + parallelize + bulk aliases.

Both runs use the same seeded synthetic DB content designed to match realistic use so that the DB is identical across baseline and new at each preset size. File types split 60/30/10 scenes/images/galleries. Match distribution: ~30% multi-match, ~50% single-match, ~20% no-match.

Hardware: MacBook (ARM, Darwin 25.2.0), Go 1.26.

column	meaning
time	wall-clock for the file-based auto-tag run (excludes DB setup)
heap in use	`runtime.MemStats.HeapInuse` at end of run (post-GC of setup garbage)
peak rss	`getrusage(RUSAGE_SELF).ru_maxrss` at end of run — process-lifetime peak
total alloc	cumulative allocations during the run (not retained memory)
gc	GC cycles during the run

tiny — 100 performers · 20 studios · 50 tags · 1 000 files

run	time	heap in use	peak rss	total alloc	gc
baseline	309ms	4.3 MiB	38.0 MiB	101.4 MiB	77
new	107ms	5.1 MiB	39.5 MiB	36.5 MiB	22
ratio	2.9× faster	+0.8 MiB	+1.5 MiB	−64%	−71%

Small absolute numbers, no regression at low scale — new is still meaningfully faster even when the preload overhead has little to amortize against.

small — 1 000 performers · 200 studios · 300 tags · 10 000 files

run	time	heap in use	peak rss	total alloc	gc
baseline	9.045s	5.8 MiB	46.3 MiB	5.0 GiB	2 339
new	1.238s	14.7 MiB	60.5 MiB	461.7 MiB	79
ratio	7.3× faster	+8.9 MiB	+14.2 MiB	−91%	−97%

Allocation churn drops sharply (5.0 GiB → 461 MiB) — the prefix index and regex cache eliminate most of the per-file regex compilation & name string-lowering that the baseline pays on every file.

medium — 10 000 performers · 1 300 studios · 1 000 tags · 50 000 files

run	time	heap in use	peak rss	total alloc	gc
baseline	4m 7.016s	11.5 MiB	57.8 MiB	187.3 GiB	34 357
new	9.208s	62.5 MiB	129.2 MiB	4.2 GiB	135
ratio	26.8× faster	+51.0 MiB	+71.4 MiB	−98%	−99.6%

Baseline burns 187 GiB of transient allocations across 34 k GC cycles — symptom of the per-file QueryForAutoTag + regex-recompile + path-lowercase treadmill. The preload fixes all three.

large — 100 000 performers · 13 000 studios · 3 000 tags · 100 000 files

run	time	heap in use	peak rss	total alloc	gc
baseline	aborted after 29:30 (see note)	—	—	—	—
new	2m 10.997s	573.5 MiB	724.4 MiB	46.6 GiB	170
ratio	≥13× faster (lower bound)	—	—	—	—

Note on baseline: we let baseline-large run for 29:30 CPU time and checked DB state — it had processed 13 395 / 60 000 scenes (22.3%) and hadn't started images (30k) or galleries (10k). Extrapolating the observed ~7 scenes/sec rate, baseline-large would need ~3.5 more hours to complete (~4 hours total). We aborted to save compute. The ≥13× faster ratio uses only the ~30 min observed on baseline, not the extrapolated full run; with the full run the multiplier is ~60×.

Memory footprint at this scale: 573 MiB heap in use / 724 MiB peak RSS. The preloaded in-memory candidate set (100 k performers × pointer + prefix index + regex cache + all aliases) is the price for the throughput gain. For reference: stashdb's ~100 k performers / 13 k studios / 3 k tags fits comfortably in this envelope on any modern machine; the baseline is effectively unusable at this scale (4-hour auto-tag run) even though it uses little memory.

Summary

preset	baseline	new	speedup
tiny	309ms	107ms	2.9×
small	9.045s	1.238s	7.3×
medium	4m 7s	9.2s	26.8×
large	≥~4h (aborted at 22%)	2m 11s	≥13× (~60× extrapolated)

Speedup grows with scale. At the small end the preload + parallelism overhead is already paid back. At stashdb-comparable scale (large) the baseline is unusable for interactive auto-tagging; the new code finishes in 2 minutes.

Memory trade-off

Heap in use scales with entity count because the preloaded index holds all performers/studios/tags + their aliases + compiled regexps in memory:

preset	heap in use (baseline → new)	peak rss (baseline → new)
tiny	4.3 MiB → 5.1 MiB	38.0 MiB → 39.5 MiB
small	5.8 MiB → 14.7 MiB	46.3 MiB → 60.5 MiB
medium	11.5 MiB → 62.5 MiB	57.8 MiB → 129.2 MiB
large	— → 573.5 MiB	— → 724.4 MiB

The cache is job-scoped: it's held for the duration of one auto-tag run and released afterwards, so this doesn't inflate the stash process's steady-state RSS. A user with a 100 k-performer library paying 725 MiB peak RSS during a 2-minute auto-tag job (versus 4 hours with negligible extra RAM) is an unambiguous win.

Replaces the per-file SQL QueryForAutoTag prefilter with an in-memory 2-rune prefix index over performers/studios/tags, preloaded once at job start. Also: - runs file processing through job.TaskQueue so scenes/images/ galleries tag in parallel instead of one file at a time - keyset-paginates the query loop so batch N+1 doesn't pay the O(offset) scan past large tables - bulk-loads studio/tag aliases via a new optional AllAliasLoader interface, avoiding N+1 GetAliases calls during preload - caches compiled name regexps (same candidate names repeat across thousands of files) - hoists strings.ToLower(path) and allASCII(path) out of the per- candidate match loop - opens a fresh write txn per applied match instead of holding one for every tagger phase Tagger gains *AtPath methods that own the cache + txn manager, letting the task code stay slim.

The preceding commit added lru.Cache for the compiled-regexp cache to match the style in pkg/sqlite/regex.go. That file's use case is different: a small bounded cache serving a read-dominated workload. The auto-tag regexp cache is job-scoped (so eviction buys us nothing) and hit by every worker on every candidate (so the LRU's per-Get mutex becomes contention, measurable under the parallel worker pool). sync.Map's read-optimised path avoids the contention without changing any observable behavior. Kept as a separate commit so it can be reverted independently if upstream prefers the LRU approach — the first commit stands on its own either way.

feederbox826 · 2026-04-20T04:58:53Z

The preloaded index holds every non-ignored performer/studio/tag + aliases + compiled regexps in memory for the duration of the job. On a stashdb-comparable library (~100k performers / 13k studios / 3k tags) this is ~573 MiB heap / ~724 MiB peak RSS — job-scoped, released after the run. The baseline uses almost no extra RAM but takes ~4 hours at that scale; this code finishes in ~2 minutes. See bench below.

This is an insane amount of memory you're trying to allocate and your tables dont quite seem to line up nor extrapolate linearly. Can you provide your synthetic db and commit you're testing against? There was a memory leak in 0.31.0 that was patched

XTVB · 2026-04-20T07:35:46Z

Sure, here's the commit with the synthetic db/benchmarking script: 338c0b0

I should've already had the memory leak fix pulled in my fork when I ran it, I believe I had all the commits other than some minor documentation that got added, but happy to run the baseline again if you spot something silly I'm doing, I agree I was startled by the size of the allocations. To be clear thetotal alloc recorded is the cumulative amount, not the amount allocated at any one time, it is getting garbage collected. Total alloc goes up every time new(X) is called, even if that memory is freed a nanosecond later. Baseline allocates 187 GiB cumulatively while only holding 11.5 MiB at any moment (at medium scale) — because it recompiles the same regexps and re-lowercases the same paths millions of times, each allocation discarded immediately.

I think the allocation pressure is less relevant than the heap in use, where my change trades having more memory at once (from holding the preload and regex caches in memory) for throughput, this is released when the job ends — it's not a permanent footprint on the stash proces.

The speed improvements I'm quite confident about as on my local stash instance with performer-only auto-tagging where I have ~600 performers and ~70k files it went from taking ~70 seconds to taking ~3 seconds on non-synthetic data. I then ran the synthetic data benchmarking to validate if it was still useful on the full range of DB sizes.

XTVB added 2 commits April 19, 2026 22:22

XTVB force-pushed the autotag-perf branch from c815230 to 93a05fc Compare April 19, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Speed up file-based auto-tag (preload + parallelize)#6848

[Feature] Speed up file-based auto-tag (preload + parallelize)#6848
XTVB wants to merge 2 commits intostashapp:developfrom
XTVB:autotag-perf

XTVB commented Apr 19, 2026 •

edited

Loading

Uh oh!

feederbox826 commented Apr 20, 2026

Uh oh!

XTVB commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

XTVB commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Commits

Idea

Design choices

Trade-off

Testing

Benchmarks

tiny — 100 performers · 20 studios · 50 tags · 1 000 files

small — 1 000 performers · 200 studios · 300 tags · 10 000 files

medium — 10 000 performers · 1 300 studios · 1 000 tags · 50 000 files

large — 100 000 performers · 13 000 studios · 3 000 tags · 100 000 files

Summary

Memory trade-off

Uh oh!

feederbox826 commented Apr 20, 2026

Uh oh!

XTVB commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XTVB commented Apr 19, 2026 •

edited

Loading

XTVB commented Apr 20, 2026 •

edited

Loading