Skip to content

perf: reduce detection overhead from ~1.30x to ~1.03x#52

Merged
taobojlen merged 15 commits intomainfrom
perf/autoresearch-optimizations
Mar 18, 2026
Merged

perf: reduce detection overhead from ~1.30x to ~1.03x#52
taobojlen merged 15 commits intomainfrom
perf/autoresearch-optimizations

Conversation

@taobojlen
Copy link
Copy Markdown
Owner

Summary

  • Replaced inspect.stack() with sys._getframe() — the biggest win. The hot path (notify()) now walks raw frame objects to find the caller instead of creating FrameInfo named tuples for every frame in the stack. Both the default path and the SHOW_ALL_CALLERS path are optimized.
  • Eliminated redundant work on the alert path — cached allowlisted (model, field) pairs so repeated N+1s on the same field skip _alert() entirely after the first check. Removed redundant get_stack() call in _alert().
  • Reduced per-call allocations — removed @functools.wraps from hot-path closures (profiling showed update_wrapper was the top zeal-specific cost), switched to tuple keys, dropped unnecessary ContextVar.set() calls.
  • Cached settings lookupsZEAL_NPLUSONE_THRESHOLD and ZEAL_SHOW_ALL_CALLERS are now read once per context instead of calling hasattr(settings, ...) on every notify().
  • Updated README — overhead claim updated from "2.5x slower" to "~3-5% overhead".

Benchmark results

Overhead ratio (zeal-enabled time / baseline time, lower is better):

Path Before After
Default 1.30x ~1.03x
SHOW_ALL_CALLERS=True 1.32x ~1.03x

Absolute time on the test suite workload: 95.5ms → 72.3ms (-24%).

Methodology

Developed through 6 automated experiments using an autoresearch loop (inspired by Shopify/liquid#2056): edit one thing → run tests → benchmark → keep/discard. Each change was validated against the full test suite before benchmarking. Iteration 7 confirmed we hit the noise floor (~0.23ms real overhead on a ~74ms workload).

The auto/ directory contains the benchmark infrastructure for future optimization work.

Test plan

  • All 57 existing unit tests pass (unchanged)
  • ZEAL_SHOW_ALL_CALLERS feature preserved and tested
  • Benchmarked across all iterations to verify progressive improvement
  • Profiled at noise floor to confirm no remaining significant overhead

🤖 Generated with Claude Code

taobojlen and others added 12 commits March 18, 2026 10:42
…h → overhead_ratio=1.06

On the hot path (notify), use sys._getframe() to walk raw frame objects
instead of inspect.stack(context=0) which allocates FrameInfo named tuples
for every frame. Also skip storing full stacks in context.calls when
ZEAL_SHOW_ALL_CALLERS is not enabled — just store empty lists for counting.

The full inspect.stack() path is preserved for when ZEAL_SHOW_ALL_CALLERS
is enabled, maintaining full backward compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…() calls → overhead_ratio=1.05

After _alert() determines a (model, field) pair is allowlisted, cache that
result in the context so subsequent notify() calls for the same pair skip
the expensive _alert() path entirely (message formatting, allowlist property
allocation, fnmatch checks).

Also exclude auto/ and .venv/ from ruff and pyright checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… overhead_ratio=1.04

- Remove @functools.wraps from hot-path closures in patch_queryset_fetch_all
  and patch_queryset_function (avoids update_wrapper overhead on every queryset)
- Use tuple key (model, field, fn, lineno) instead of f-string in notify()
  fast path (avoids string allocation per call)
- Append None instead of [] to calls list (avoids empty list allocation per call)
- Cache calls[key] in local variable to avoid redundant dict lookup
- Remove redundant _nplusone_context.set(context) from notify() and ignore()
  (the context is mutated in-place; .set() with the same object just wastes
  a Token allocation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…S path → overhead_ratio=1.05, overhead_ratio_allcallers=1.07

Replace get_stack() (which calls inspect.stack(context=0) creating expensive
FrameInfo named tuples) with get_stack_fast() using sys._getframe() to build
lightweight (filename, lineno, funcname) tuples. Also eliminate redundant
get_stack() call in _alert() by using get_caller_fast() for the non-SHOW_ALL_CALLERS
path and deriving caller info from the already-captured stack for SHOW_ALL_CALLERS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…frame filtering → overhead_ratio=1.05, overhead_ratio_allcallers=1.06

Replace `any(pattern in fn for pattern in PATTERNS)` with two direct
`"site-packages" not in fn and "/zeal/" not in fn` checks in
get_caller_fast() and get_stack_fast(). Micro-benchmarks show this is
~5x faster for the per-frame pattern matching, eliminating generator
object allocation and 4-item iteration on every frame of the call stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…asattr() overhead → overhead_ratio=1.04, overhead_ratio_allcallers=1.03

Cache ZEAL_SHOW_ALL_CALLERS and ZEAL_NPLUSONE_THRESHOLD on the
NPlusOneContext dataclass using lazy initialization (None sentinel).
On the first notify() call per context, the settings are read via
hasattr() and cached; subsequent calls (~429 per workload) use the
cached value directly, avoiding ~1.2us of hasattr(settings, ...) overhead
per call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion 7)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update the README warning from "2.5x slower" to "~3-5% overhead" based
on benchmarking results from 6 optimization iterations.

Add auto/ directory with benchmark and autoresearch scripts used to
drive the optimizations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 18, 2026

Merging this PR will improve performance by ×6.3

⚡ 1 improved benchmark

Performance Changes

Benchmark BASE HEAD Efficiency
test_performance 1,957 ms 310.5 ms ×6.3

Comparing perf/autoresearch-optimizations (9bfad41) with main (2aabdaa)

Open in CodSpeed

taobojlen and others added 3 commits March 18, 2026 11:47
…ions

Remove unused PATTERNS list, old get_stack()/get_caller() that used
inspect.stack(), and the unused inspect import. Rename get_caller_fast()
→ get_caller() and get_stack_fast() → get_stack().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…al/" substring

The "/zeal/" substring check would incorrectly filter out user code if
their project path contained "/zeal/" (e.g. /home/user/zeal-app/). Now
uses os.path.dirname(__file__) to check against the actual zeal package
directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures zeal internals and site-packages are correctly identified as
internal frames, while user code — including projects with "zeal" in
the path — is not filtered out.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@taobojlen taobojlen merged commit 6b15475 into main Mar 18, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant