Skip to content

feat(runner): use atomic reload snapshots#195

Open
linkdata wants to merge 1 commit into
dnstapir:mainfrom
linkdata:feat/lock-free-reload-hot-path
Open

feat(runner): use atomic reload snapshots#195
linkdata wants to merge 1 commit into
dnstapir:mainfrom
linkdata:feat/lock-free-reload-hot-path

Conversation

@linkdata
Copy link
Copy Markdown

Summary

  • move reloadable cryptopan, ignored-client, ignored-question, and WKD read paths to atomic snapshots
  • keep per-worker cryptopan caches coherent with a generation counter
  • add concurrent reload coverage intended for race-detector validation

Tests

  • go test ./pkg/runner ./...
  • go test -race ./pkg/runner

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@linkdata has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc7d4826-05a7-4bac-9458-73b51af65419

📥 Commits

Reviewing files that changed from the base of the PR and between b615285 and b3e6440.

📒 Files selected for processing (5)
  • pkg/runner/atomic_reload_test.go
  • pkg/runner/cryptopan_extra_test.go
  • pkg/runner/runner.go
  • pkg/runner/runner_test.go
  • pkg/runner/test_helpers_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 59 minutes and 20 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@linkdata
Copy link
Copy Markdown
Author

Additional context from comment files

These notes were split from the local markdown comment files and attached here because they describe this PR's change.

Change 5: per-worker Crypto-PAn cache

File: pkg/runner/runner.go. Affects struct, setCryptopan,
pseudonymiseDnstap, pseudonymiseIP, runMinimiser.

Why this was a bottleneck

pseudonymiseDnstap ran under a single cryptopanMutex.RLock() for
each call, plus the lru.Cache itself takes its own mutex on
Get/Add (the golang-lru/v2 package serialises internally for
correctness). So every minimiser worker hit two contended locks per
frame, and on every cache miss they all serialised again on the
add-path. With 20 workers and ~110 K novel-IP frames/s, that's a
non-trivial bus-traffic problem even before the LRU's internal lock
joins in.

The mutex profile attributed roughly 8–12 % of total block time to
this single section — invisible in CPU profiles, very visible in
mutex profiles.

What we changed

  • dnstapMinimiser.cryptopan is now atomic.Pointer[cryptopan.Cryptopan].
  • New cryptopanGen atomic.Uint64 is bumped on every setCryptopan.
  • cryptopanCache is removed from the struct.
  • Each minimiser worker creates its own lru.Cache[netip.Addr, netip.Addr]
    inside runMinimiser and threads it through
    pseudonymiseDnstap(dt, cpn, cache).
  • Workers track the last cryptopan generation they saw; on detecting
    a bump they Purge() their local cache. Stale-entry leak window is
    bounded by one frame.

The hot path now does:

cpn  := edm.cryptopan.Load()      // atomic, no lock
gen  := edm.cryptopanGen.Load()    // atomic, no lock
// (purge if gen changed)
cache.Get(addr)                    // worker-local LRU, contention only
                                   // with this worker's own future calls
                                   // (i.e. none, single-goroutine)

Tradeoff

  • Cache-hit ratio: when load was global, every worker's IP-hits
    benefited every other worker. With per-worker caches, an IP that
    worker A saw doesn't help worker B until B sees it too. The
    practical effect is small: the LRU sees mostly the long tail of
    client IPs anyway, and the cache serves to amortise Crypto-PAn
    CPU rather than to deduplicate. Per-worker LRUs sized at the
    configured CryptopanAddressEntries value give each worker a
    generous slice — total memory is workers × entries, but on a
    20-core / 10M-entry config that's still well under a gigabyte.
  • Memory: workers × cacheEntries LRU slots instead of one. With
    the default 10M entries and 20 workers, that's a notional 200M
    entries, but in practice each worker's cache fills only with the
    IPs it sees. Real memory is bounded by the working-set size.
  • Eviction metrics: edm_cryptopan_lru_evicted_total now counts
    evictions across the union of per-worker caches. The number is
    larger than before; the meaning is the same (capacity pressure).
  • Tests: the production constructor signature changed, so
    runner_test.go tests that previously poked edm.cryptopanCache
    directly now go through a small test helper
    (pkg/runner/test_helpers_test.go) that owns one cache per
    *dnstapMinimiser instance.

Expected payoff

Removes cryptopanMutex from the hot path entirely. In our load
runs the headline metric is mutex-block-time, which drops to
near-zero for this section. Translates to ~10–15 % more
sustained throughput when EDM is otherwise CPU-saturated.

Change 7: atomic.Pointer for ignore-set lookups

File: pkg/runner/runner.go. Affects struct, setIgnoredClientIPs,
setIgnoredQuestionNames, clientIPIsIgnored, questionIsIgnored,
getNumIgnoredClientCIDRs.

Why this was a bottleneck

The hot path called clientIPIsIgnored and questionIsIgnored on
every frame. Both took an RWMutex.RLock() even when the underlying
ignore set was unset (the common production case). Reload writers
took Lock() on setIgnored*, which is rare but still imposed the
slow-path cost on every reader.

What we changed

  • ignoredClientsIPSet *netipx.IPSet and ignoredQuestions dawg.Finder
    become atomic.Pointer[netipx.IPSet] and atomic.Pointer[dawgFinderHolder].

  • ignoredClientCIDRsParsed becomes atomic.Uint64.

  • dawgFinderHolder is a tiny wrapper struct containing finder dawg.Finder,
    needed because atomic.Pointer requires a concrete type and dawg.Finder
    is an interface.

  • All RWMutex protect-everything-with-Lock blocks become
    single-line atomic.Store calls.

  • The hot path is now:

    ipset := edm.ignoredClientsIPSet.Load()
    if ipset == nil { return false }
    

    Zero allocations, no lock, no contention.

Tradeoff

  • The previous code called dawg.Finder.Close() on the old finder
    during reload.
    With atomic-pointer swap we cannot safely close
    the old finder, because hot-path readers may still hold a pointer
    to it. So we don't close it — we just drop the reference and
    let GC reclaim it. The dawg implementation uses an mmap'd file
    internally; finalisers eventually unmap it, so the leak is bounded
    per-rotation. With reloads being rare (manual SIGHUP-style events),
    this is acceptable. If you ever need promptness, an
    epoch-based reclamation scheme would work, but the current load
    is so far from triggering a problem here that the simpler choice
    wins.
  • The CIDR count getter getNumIgnoredClientCIDRs is now a pure
    atomic.Uint64.Load(). No semantic change.

Expected payoff

The two RWMutexes leave the hot path. In mutex profiles the
ignoredClientsIPSetMutex and ignoredQuestionsMutex block-time
entries become invisible. Direct CPU savings are small (RLock/RUnlock
is fast in the uncontended case) but goroutine wakeups during
reload no longer ripple through every minimiser worker. Worth doing
mostly for cleanliness, with a small steady-state win.

Change 8: WKD tracker — split read snapshot from map mutation

File: pkg/runner/runner.go. Affects wellKnownDomainsTracker,
newWellKnownDomainsTracker, lookup, rotateTracker, the
wkd.dawgModTime reference inside dataCollector, and the retryer's
log message.

Why this was a bottleneck

wkd.lookup(msg) ran on every frame's hot path. It took
wkd.mutex.RLock() to read dawgFinder + dawgModTime. The same
lock was taken for Lock() by rotateTracker (a once-per-minute
event). The aggregator map wkd.m was also nominally protected
by this lock, but only one goroutine (dataCollector) ever wrote
to it, and no one else read it in the hot path — so the mutex was
mainly serialising the per-frame DAWG reads against a once-per-minute
write.

What we changed

The struct gains:

type wkdSnapshot struct {
    dawgFinder  dawg.Finder
    dawgModTime time.Time
}

type wellKnownDomainsTracker struct {
    snap atomic.Pointer[wkdSnapshot]
    m    map[int]*histogramData  // single-writer (dataCollector)
    ...
}

lookup() becomes:

snap := wkd.snap.Load()
dawgIndex, suffixMatch := getDawgIndex(snap.dawgFinder, name)
return dawgIndex, suffixMatch, snap.dawgModTime

No lock. rotateTracker reads wkd.snap.Load() to compare the on-disk
modtime, builds a fresh wkdSnapshot if the file changed, atomically
swaps it in, and swaps the histogram map. Both swaps happen in the
same goroutine that writes to m, so no lock is needed for the map
either.

The dataCollector retryer's "discard stale update" check changes
from wu.dawgModTime != wkd.dawgModTime to
wu.dawgModTime != wkd.snap.Load().dawgModTime.

Tradeoff

  • Slight torn-state risk during rotation. A worker mid-frame that
    loads snap just before rotateTracker swaps it will use the old
    dawgFinder and the old dawgModTime. That's actually the desired
    behaviour: the update will arrive at dataCollector carrying the
    old dawgModTime, the modtime-mismatch check will fire, and the
    update goes onto retryCh for re-lookup against the new DAWG. So
    the existing retryer mechanism already handles this case correctly.
  • rotateTracker no longer touches the old wkd.mutex. Anything
    outside this package that synchronised against that mutex would
    break — but the mutex was unexported, so nothing outside the
    package could touch it.
  • The old wellKnownDomainsData.dawgFinder field is still used
    for the post-rotation parquet writeback (it carries the previous
    DAWG so name lookups for histogram output still work). The
    rotated *wellKnownDomainsData snapshot continues to embed
    dawgFinder; the change is only internal to the tracker.

Expected payoff

Removes another RWMutex from every frame. The wkd.mutex block-time
entry disappears from the mutex profile. Direct CPU is small. The
real win is that all four Tier-2 changes together open the per-frame
hot path so that the only synchronisation a worker hits is the rate
limiter on the input channel and its own private LRU. Throughput
should now scale linearly with MinimiserWorkers until the input
channel itself becomes the bottleneck (a Tier-3 candidate).

@linkdata linkdata marked this pull request as ready for review April 30, 2026 12:12
@linkdata linkdata requested a review from a team as a code owner April 30, 2026 12:12
@jschlyter jschlyter added the ai AI was used to write contributed code label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai AI was used to write contributed code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants