Skip to content

feat(runner): tune minimiser hot path buffers#194

Open
linkdata wants to merge 1 commit into
dnstapir:mainfrom
linkdata:feat/minimiser-throughput-tuning
Open

feat(runner): tune minimiser hot path buffers#194
linkdata wants to merge 1 commit into
dnstapir:mainfrom
linkdata:feat/minimiser-throughput-tuning

Conversation

@linkdata
Copy link
Copy Markdown

Summary

  • increase the dnstap input buffer to reduce producer stalls under high packet rates
  • reuse a per-worker scratch buffer for client IP HLL hashing
  • allow minimiser-workers=0 to reach the existing GOMAXPROCS fallback

Tests

  • go test ./pkg/runner ./...

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@linkdata has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 22 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fdebe485-fe61-4108-b3d6-49d569ffbb18

📥 Commits

Reviewing files that changed from the base of the PR and between b615285 and 76eeecb.

📒 Files selected for processing (1)
  • pkg/runner/runner.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 59 minutes and 22 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@linkdata
Copy link
Copy Markdown
Author

Additional context from comment files

These notes were split from the local markdown comment files and attached here because they describe this PR's change.

Change 2: raise the input-channel buffer (32 → 1024)

File: pkg/runner/runner.go (~line 1588).

Before:

edm.inputChannel = make(chan []byte, 32)

After:

edm.inputChannel = make(chan []byte, 1024)

Why this was a bottleneck

inputChannel is the single fan-out point between the framestream
listener goroutine and the N minimiser workers. The listener is
single-threaded (one connection, one reader per dnstap.ReadInto), and
all minimiser workers receive from the same channel.

A 32-deep buffer drains in **~ 160 µs** at 200 K qps. That is well below
typical Linux scheduler wake-up latencies (1–10 ms under load) and
similar to a single GC cycle pause. Any time a minimiser worker takes
longer than 160 µs on a frame — for example, while waiting on the
Crypto-PAn LRU mutex, doing a Pebble lookup, or waiting on a full
session-collector channel — the input channel fills and the listener
stalls in a chan send. With one full buffer worth of stall the
upstream TCP socket buffer also fills, which propagates back-pressure
through the kernel to the producer.

The result is a system that works cleanly at a steady state but is
extremely sensitive to any per-frame jitter: a single 1 ms hiccup
anywhere in the worker pool freezes ingestion for that millisecond.

What we changed

The buffer is now 1024 frames. At an average frame size below 1 KiB,
worst-case memory growth is ~ 1 MiB — negligible against the 4 GiB
container limits we typically run with. There is no API change.

Tradeoff

A larger buffer absorbs more upstream jitter, which is what we want.
The (small) downside is that the buffer also absorbs an extra millisecond
or so of staleness under steady-state load: a frame that arrives just
before a worker stalls now sits a little longer before being processed.
That is invisible to downstream consumers — parquet timestamps come from
the frame's own query_time / response_time fields, and minute-bucket
aggregation is unaffected. We are not aware of any correctness path that
depends on a small input-channel depth.

Expected payoff

Eliminates a class of jitter-induced stalls. Smooths out throughput
under bursty input. Modest direct gain (5–15 %) on our load tests; the
larger benefit is reduced tail latency and easier scaling for Tier 2
changes that further reduce per-frame work.


Change 3: per-worker scratch buffer for dangerRealClientIP

File: pkg/runner/runner.go, in runMinimiser.

Before:

dangerRealClientIP := make([]byte, len(dt.Message.QueryAddress))
copy(dangerRealClientIP, dt.Message.QueryAddress)

After: declared once per worker as var dangerScratch [16]byte
above the frame loop, and re-used per frame:

n := len(dt.Message.QueryAddress)
var dangerRealClientIP []byte
if n <= len(dangerScratch) {
    dangerRealClientIP = dangerScratch[:n]
} else {
    // Defensive fallback for unexpected address sizes.
    dangerRealClientIP = make([]byte, n)
}
copy(dangerRealClientIP, dt.Message.QueryAddress)

Why this was a bottleneck

The original code does a heap allocation per frame, per minimiser
worker
— a small []byte of length 4 (IPv4) or 16 (IPv6). At 200 K
frames per second, that is 200 K small allocations per second on the hot
path. Combined with the per-frame DNS message and session struct
allocations elsewhere, the GC pressure registered at ~ 14 % of total CPU
in runtime.gcBgMarkWorker and runtime.gcDrain in our profile.

A 16-byte scratch buffer allocated once per worker covers IPv4 and IPv6
without any per-frame heap activity.

Tradeoff: aliasing safety

The risk with re-using a buffer is that some downstream code retains a
reference to the slice past the call boundary, and the next frame
overwrites the bytes underneath them.

dangerRealClientIP is passed only to wkdTracker.sendUpdate, which
calls:

  • netip.AddrFromSlice(ipBytes) — copies bytes into the netip.Addr
    value type and returns it; does not retain the input slice;
  • murmur3.Sum64(ipBytes) — reads and hashes; does not retain the
    input slice.

Neither call stores ipBytes in any persistent structure. The hash and
the parsed netip.Addr (a value type) are stored in a wkdUpdate
struct that is later sent on a channel, but those are independent of the
original slice. Verified by inspection at pkg/runner/runner.go
~line 1734.

If a future change to sendUpdate or its callees ever needed to retain
the slice, this scratch-buffer reuse would become unsafe. To make that
mistake observable rather than silent, the per-worker buffer is named
dangerScratch and the slice is named dangerRealClientIP — keeping the
existing "danger" prefix that flags it as not-for-storage.

The defensive fallback (make([]byte, n) if n > 16) preserves
correctness for any future dnstap address that exceeds 16 bytes. This
should never happen in practice (dnstap defines query_address as the
on-the-wire address bytes, which are 4 for IPv4 and 16 for IPv6), but
the cost of the branch is trivial and it keeps us out of an
out-of-bounds slice.

Expected payoff

Removes one heap allocation per frame on the hot path. Roughly 2–5 %
direct CPU gain plus a corresponding reduction in GC overhead. More
importantly, this is a precedent for the Tier-3 sync.Pool-based
allocator work — if and when we go after the dns.Msg and session
allocations, the same aliasing analysis pattern applies.


Change 4: allow --minimiser-workers 0 to mean GOMAXPROCS

File: pkg/runner/runner.go config struct (~line 86).

Before:

MinimiserWorkers int `mapstructure:"minimiser-workers" validate:"required"`

After:

MinimiserWorkers int `mapstructure:"minimiser-workers"`

Why this was a bottleneck (or rather: foot-gun)

The CLI help text reads:

--minimiser-workers int   how many minimiser workers to start
                          (0 means same as GOMAXPROCS) (default 1)

…and the runtime path at ~line 1376 honours this:

numMinimiserWorkers := startConf.MinimiserWorkers
if numMinimiserWorkers <= 0 {
    numMinimiserWorkers = runtime.GOMAXPROCS(0)
}

But the config struct's validate:"required" tag rejects the literal
0 value at startup with:

unable to validate config: Key: 'config.MinimiserWorkers'
Error:Field validation for 'MinimiserWorkers' failed on the 'required' tag

So the documented "use 0 for GOMAXPROCS" path is unreachable. Operators
have to either pick a number explicitly (which couples the config to a
particular host's core count) or omit the flag and accept the default
of 1, which leaves EDM single-threaded on the minimiser side.

What we changed

Removed the required validator. The default value (1) and the runtime
treatment of 0 are unchanged. Operators who set --minimiser-workers 0
now get the GOMAXPROCS behaviour the help text already described.

Tradeoff

None on the runtime side — behaviour for any positive value is
identical to before. The only change is that an explicit 0 no longer
crashes startup. Operators who want the 1-worker default still get it
(by setting nothing or by setting a positive integer).

Expected payoff

Indirect. Lets the documented "use all cores" mode actually work. On
multi-core hosts this typically multiplies minimiser throughput by the
ratio of cores to whatever fixed value was previously used.


@linkdata linkdata marked this pull request as ready for review April 30, 2026 12:12
@linkdata linkdata requested a review from a team as a code owner April 30, 2026 12:12
@jschlyter jschlyter added the ai AI was used to write contributed code label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai AI was used to write contributed code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants