feat(runner): use atomic reload snapshots by linkdata · Pull Request #195 · dnstapir/edm

linkdata · 2026-04-30T12:04:48Z

Summary

move reloadable cryptopan, ignored-client, ignored-question, and WKD read paths to atomic snapshots
keep per-worker cryptopan caches coherent with a generation counter
add concurrent reload coverage intended for race-detector validation

Tests

go test ./pkg/runner ./...
go test -race ./pkg/runner

coderabbitai · 2026-04-30T12:04:56Z

Warning

Rate limit exceeded

@linkdata has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 20 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc7d4826-05a7-4bac-9458-73b51af65419

📥 Commits

Reviewing files that changed from the base of the PR and between b615285 and b3e6440.

📒 Files selected for processing (5)

pkg/runner/atomic_reload_test.go
pkg/runner/cryptopan_extra_test.go
pkg/runner/runner.go
pkg/runner/runner_test.go
pkg/runner/test_helpers_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 59 minutes and 20 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

linkdata · 2026-04-30T12:12:04Z

Additional context from comment files

These notes were split from the local markdown comment files and attached here because they describe this PR's change.

Change 5: per-worker Crypto-PAn cache

File: pkg/runner/runner.go. Affects struct, setCryptopan,
pseudonymiseDnstap, pseudonymiseIP, runMinimiser.

Why this was a bottleneck

pseudonymiseDnstap ran under a single cryptopanMutex.RLock() for
each call, plus the lru.Cache itself takes its own mutex on
Get/Add (the golang-lru/v2 package serialises internally for
correctness). So every minimiser worker hit two contended locks per
frame, and on every cache miss they all serialised again on the
add-path. With 20 workers and ~110 K novel-IP frames/s, that's a
non-trivial bus-traffic problem even before the LRU's internal lock
joins in.

The mutex profile attributed roughly 8–12 % of total block time to
this single section — invisible in CPU profiles, very visible in
mutex profiles.

What we changed

dnstapMinimiser.cryptopan is now atomic.Pointer[cryptopan.Cryptopan].
New cryptopanGen atomic.Uint64 is bumped on every setCryptopan.
cryptopanCache is removed from the struct.
Each minimiser worker creates its own lru.Cache[netip.Addr, netip.Addr]
inside runMinimiser and threads it through
pseudonymiseDnstap(dt, cpn, cache).
Workers track the last cryptopan generation they saw; on detecting
a bump they Purge() their local cache. Stale-entry leak window is
bounded by one frame.

The hot path now does:

cpn  := edm.cryptopan.Load()      // atomic, no lock
gen  := edm.cryptopanGen.Load()    // atomic, no lock
// (purge if gen changed)
cache.Get(addr)                    // worker-local LRU, contention only
                                   // with this worker's own future calls
                                   // (i.e. none, single-goroutine)

Tradeoff

Cache-hit ratio: when load was global, every worker's IP-hits
benefited every other worker. With per-worker caches, an IP that
worker A saw doesn't help worker B until B sees it too. The
practical effect is small: the LRU sees mostly the long tail of
client IPs anyway, and the cache serves to amortise Crypto-PAn
CPU rather than to deduplicate. Per-worker LRUs sized at the
configured CryptopanAddressEntries value give each worker a
generous slice — total memory is workers × entries, but on a
20-core / 10M-entry config that's still well under a gigabyte.
Memory: workers × cacheEntries LRU slots instead of one. With
the default 10M entries and 20 workers, that's a notional 200M
entries, but in practice each worker's cache fills only with the
IPs it sees. Real memory is bounded by the working-set size.
Eviction metrics: edm_cryptopan_lru_evicted_total now counts
evictions across the union of per-worker caches. The number is
larger than before; the meaning is the same (capacity pressure).
Tests: the production constructor signature changed, so
runner_test.go tests that previously poked edm.cryptopanCache
directly now go through a small test helper
(pkg/runner/test_helpers_test.go) that owns one cache per
*dnstapMinimiser instance.

Expected payoff

Removes cryptopanMutex from the hot path entirely. In our load
runs the headline metric is mutex-block-time, which drops to
near-zero for this section. Translates to ~10–15 % more
sustained throughput when EDM is otherwise CPU-saturated.

Change 7: atomic.Pointer for ignore-set lookups

File: pkg/runner/runner.go. Affects struct, setIgnoredClientIPs,
setIgnoredQuestionNames, clientIPIsIgnored, questionIsIgnored,
getNumIgnoredClientCIDRs.

Why this was a bottleneck

The hot path called clientIPIsIgnored and questionIsIgnored on
every frame. Both took an RWMutex.RLock() even when the underlying
ignore set was unset (the common production case). Reload writers
took Lock() on setIgnored*, which is rare but still imposed the
slow-path cost on every reader.

What we changed

ignoredClientsIPSet *netipx.IPSet and ignoredQuestions dawg.Finder
become atomic.Pointer[netipx.IPSet] and atomic.Pointer[dawgFinderHolder].
ignoredClientCIDRsParsed becomes atomic.Uint64.
dawgFinderHolder is a tiny wrapper struct containing finder dawg.Finder,
needed because atomic.Pointer requires a concrete type and dawg.Finder
is an interface.
All RWMutex protect-everything-with-Lock blocks become
single-line atomic.Store calls.

The hot path is now:

ipset := edm.ignoredClientsIPSet.Load()
if ipset == nil { return false }

Zero allocations, no lock, no contention.

Tradeoff

The previous code called dawg.Finder.Close() on the old finder
during reload. With atomic-pointer swap we cannot safely close
the old finder, because hot-path readers may still hold a pointer
to it. So we don't close it — we just drop the reference and
let GC reclaim it. The dawg implementation uses an mmap'd file
internally; finalisers eventually unmap it, so the leak is bounded
per-rotation. With reloads being rare (manual SIGHUP-style events),
this is acceptable. If you ever need promptness, an
epoch-based reclamation scheme would work, but the current load
is so far from triggering a problem here that the simpler choice
wins.
The CIDR count getter getNumIgnoredClientCIDRs is now a pure
atomic.Uint64.Load(). No semantic change.

Expected payoff

The two RWMutexes leave the hot path. In mutex profiles the
ignoredClientsIPSetMutex and ignoredQuestionsMutex block-time
entries become invisible. Direct CPU savings are small (RLock/RUnlock
is fast in the uncontended case) but goroutine wakeups during
reload no longer ripple through every minimiser worker. Worth doing
mostly for cleanliness, with a small steady-state win.

Change 8: WKD tracker — split read snapshot from map mutation

File: pkg/runner/runner.go. Affects wellKnownDomainsTracker,
newWellKnownDomainsTracker, lookup, rotateTracker, the
wkd.dawgModTime reference inside dataCollector, and the retryer's
log message.

Why this was a bottleneck

wkd.lookup(msg) ran on every frame's hot path. It took
wkd.mutex.RLock() to read dawgFinder + dawgModTime. The same
lock was taken for Lock() by rotateTracker (a once-per-minute
event). The aggregator map wkd.m was also nominally protected
by this lock, but only one goroutine (dataCollector) ever wrote
to it, and no one else read it in the hot path — so the mutex was
mainly serialising the per-frame DAWG reads against a once-per-minute
write.

What we changed

The struct gains:

type wkdSnapshot struct {
    dawgFinder  dawg.Finder
    dawgModTime time.Time
}

type wellKnownDomainsTracker struct {
    snap atomic.Pointer[wkdSnapshot]
    m    map[int]*histogramData  // single-writer (dataCollector)
    ...
}

lookup() becomes:

snap := wkd.snap.Load()
dawgIndex, suffixMatch := getDawgIndex(snap.dawgFinder, name)
return dawgIndex, suffixMatch, snap.dawgModTime

No lock. rotateTracker reads wkd.snap.Load() to compare the on-disk
modtime, builds a fresh wkdSnapshot if the file changed, atomically
swaps it in, and swaps the histogram map. Both swaps happen in the
same goroutine that writes to m, so no lock is needed for the map
either.

The dataCollector retryer's "discard stale update" check changes
from wu.dawgModTime != wkd.dawgModTime to
wu.dawgModTime != wkd.snap.Load().dawgModTime.

Tradeoff

Slight torn-state risk during rotation. A worker mid-frame that
loads snap just before rotateTracker swaps it will use the old
dawgFinder and the old dawgModTime. That's actually the desired
behaviour: the update will arrive at dataCollector carrying the
old dawgModTime, the modtime-mismatch check will fire, and the
update goes onto retryCh for re-lookup against the new DAWG. So
the existing retryer mechanism already handles this case correctly.
rotateTracker no longer touches the old wkd.mutex. Anything
outside this package that synchronised against that mutex would
break — but the mutex was unexported, so nothing outside the
package could touch it.
The old wellKnownDomainsData.dawgFinder field is still used
for the post-rotation parquet writeback (it carries the previous
DAWG so name lookups for histogram output still work). The
rotated *wellKnownDomainsData snapshot continues to embed
dawgFinder; the change is only internal to the tracker.

Expected payoff

Removes another RWMutex from every frame. The wkd.mutex block-time
entry disappears from the mutex profile. Direct CPU is small. The
real win is that all four Tier-2 changes together open the per-frame
hot path so that the only synchronisation a worker hits is the rate
limiter on the input channel and its own private LRU. Throughput
should now scale linearly with MinimiserWorkers until the input
channel itself becomes the bottleneck (a Tier-3 candidate).

feat(runner): use atomic reload snapshots

b3e6440

linkdata marked this pull request as ready for review April 30, 2026 12:12

linkdata requested a review from a team as a code owner April 30, 2026 12:12

jschlyter added the ai AI was used to write contributed code label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runner): use atomic reload snapshots#195

feat(runner): use atomic reload snapshots#195
linkdata wants to merge 1 commit into
dnstapir:mainfrom
linkdata:feat/lock-free-reload-hot-path

linkdata commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

linkdata commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linkdata commented Apr 30, 2026

Summary

Tests

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

linkdata commented Apr 30, 2026

Additional context from comment files

Change 5: per-worker Crypto-PAn cache

Why this was a bottleneck

What we changed

Tradeoff

Expected payoff

Change 7: atomic.Pointer for ignore-set lookups

Why this was a bottleneck

What we changed

Tradeoff

Expected payoff

Change 8: WKD tracker — split read snapshot from map mutation

Why this was a bottleneck

What we changed

Tradeoff

Expected payoff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading