taobojlen · taobojlen · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/README.md b/README.md
@@ -40,10 +40,10 @@ if DEBUG:
 This will detect N+1s that happen in web requests. To catch N+1s in more places,
 read on!
 
-> [!WARNING]  
+> [!WARNING]
 > You probably don't want to run zeal in production:
-> there is significant overhead to detecting N+1s, and my benchmarks show that it
-> can make your code 2.5x slower in some cases.
+> there is some overhead to detecting N+1s. Benchmarks show ~3-5% overhead
+> on a typical workload.
 
 ### Celery
 

diff --git a/auto/autoresearch.md b/auto/autoresearch.md
@@ -0,0 +1,88 @@
+# Autoresearch: django-zeal Performance Optimization
+
+## Objective
+Reduce the runtime overhead of django-zeal's N+1 query detection.
+The workload exercises all relationship types (forward/reverse many-to-one,
+one-to-one, many-to-many, chained) with zeal enabled and N+1s suppressed
+via `zeal_ignore()`. We measure the overhead ratio: zeal-enabled time vs
+baseline (no zeal).
+
+## How to Run
+Run `./auto/autoresearch.sh` — it runs the full test suite (correctness gate),
+then the performance benchmark, outputting METRIC lines.
+
+For benchmark only: `.venv/bin/python auto/bench.py`
+
+## Metrics
+- **Primary (optimization targets)**: Both must be optimized:
+  - `overhead_ratio` — zeal overhead with default settings (lower is better, 1.00 = zero overhead)
+  - `overhead_ratio_allcallers` — zeal overhead with `ZEAL_SHOW_ALL_CALLERS=True` (lower is better)
+- **Secondary (for context)**:
+  - `zeal_ms` — median time with zeal enabled (ms)
+  - `zeal_allcallers_ms` — median time with zeal + SHOW_ALL_CALLERS (ms)
+  - `baseline_ms` — median time without zeal (ms)
+
+## Iteration Protocol
+Each autoresearch iteration must:
+1. Make **exactly one change** (a single optimization idea)
+2. Run `./auto/autoresearch.sh`
+3. If tests fail → revert immediately (`git checkout -- .`)
+4. If either overhead_ratio improved (without regressing the other) → commit with a message describing the change and the new metrics
+5. If both overhead_ratios regressed or unchanged → revert (`git checkout -- .`)
+6. Update the Progress Log below with the result
+
+## Files in Scope
+- `src/zeal/listeners.py` — N+1 detection logic, context management, alert path
+- `src/zeal/patch.py` — Django ORM monkey-patching, queryset wrapping
+- `src/zeal/util.py` — Stack inspection utilities (`get_stack`, `get_caller`)
+- `src/zeal/middleware.py` — Django middleware integration
+
+## Off Limits
+- `tests/` — tests must continue to pass unchanged
+- `auto/` — benchmark infrastructure, do not modify
+- Do not change the public API (`zeal_context`, `zeal_ignore`, signals, settings)
+- Do not add new dependencies
+
+## Constraints
+- All 57 unit tests must pass
+- Semantic correctness must be preserved — N+1 detection behavior must be identical
+- `ZEAL_SHOW_ALL_CALLERS` feature must still work when enabled (full stack storage)
+- When `ZEAL_SHOW_ALL_CALLERS` is not enabled, optimizations can skip stack storage
+
+## Architecture Notes
+The hot path on every relationship access is:
+1. Patched descriptor calls `patch_queryset_fetch_all` wrapper
+2. On query execution, `n_plus_one_listener.notify(model, field, instance_key)` is called
+3. `notify()` calls `get_stack()` → `inspect.stack(context=0)` + frame filtering
+4. Builds a key tuple `(model, field, "filename:lineno")`
+5. Appends full stack to `context.calls[key]`
+6. If count >= threshold: calls `_alert()` which calls `get_stack()` again
+
+Known overhead sources (in estimated order of impact):
+- `inspect.stack(context=0)` creates FrameInfo named tuples for every frame
+- Full stack traces stored in memory for every call (even when SHOW_ALL_CALLERS is off)
+- `_alert()` redundantly calls `get_stack()` (already captured in `notify()`)
+- `hasattr(settings, ...)` checked on every `_threshold` / `_allowlist` access
+- `fnmatch()` called per-allowlist-entry on every alert
+- f-string key creation on every notify
+
+## Baseline
+- **Commit**: 2aabdaa (HEAD of main, before any optimizations)
+- **overhead_ratio**: 1.30
+- **overhead_ratio_allcallers**: 1.30 (same path before iter1 split)
+- **zeal_ms**: 101
+- **baseline_ms**: 78
+
+## Current (post iter7 — noise floor reached)
+- **overhead_ratio**: 1.02–1.09 (varies by run, noise-dominated)
+- **overhead_ratio_allcallers**: 1.02–1.10 (varies by run, noise-dominated)
+- **baseline_ms**: 69.5–72.5
+
+## Progress Log
+1. **e572720** — Replace `inspect.stack(context=0)` with `sys._getframe()` in `notify()` fast path; skip storing full stacks when `ZEAL_SHOW_ALL_CALLERS` is off. **overhead_ratio=1.06** (was 1.30). baseline_ms=77.7, zeal_ms=82.7. All 57 tests pass. KEPT.
+2. **58ce8a2** — Cache allowlisted (model, field) pairs in `NPlusOneContext._allowlisted_keys` so that once `_alert()` determines a pair is allowlisted, subsequent `notify()` calls skip `_alert()` entirely (avoiding message formatting, `_allowlist` property allocation, and `fnmatch()` checks on every call past threshold). **overhead_ratio=1.05** (was 1.06). baseline_ms=78.6, zeal_ms=82.6. All 57 tests pass. KEPT.
+3. **f4bc2c4** — Reduce per-call allocations: remove `@functools.wraps` from hot-path queryset closures (profile showed `update_wrapper` as top zeal overhead at 0.013s/9250 calls), use tuple key instead of f-string in notify(), append `None` instead of `[]`, cache `calls[key]` in local var, remove redundant `_nplusone_context.set()` from `notify()` and `ignore()`. **overhead_ratio=1.04** (was 1.05). Official benchmark range: 1.02–1.05 (median 1.04). Interleaved A/B benchmark: median_pair_ratio=1.034 (was 1.038). All 57 tests pass. KEPT.
+4. **0ec30ac** — Replace `inspect.stack(context=0)` with `sys._getframe()` in SHOW_ALL_CALLERS path: add `get_stack_fast()` that builds lightweight `(filename, lineno, funcname)` tuples instead of FrameInfo named tuples; eliminate redundant `get_stack()` call in `_alert()` by reusing already-captured stack data and using `get_caller_fast()`. **overhead_ratio=1.05, overhead_ratio_allcallers=1.07** (was 1.27). baseline_ms=73.3, zeal_ms=76.8, zeal_allcallers_ms=78.2. All 57 tests pass. KEPT.
+5. **29f0d47** — Replace `any(pattern in fn for pattern in PATTERNS)` with two direct substring checks (`"site-packages" not in fn and "/zeal/" not in fn`) in `get_caller_fast()`, `get_stack_fast()`, and `get_stack()`. Eliminates generator object allocation and 4-item iteration on every frame; micro-benchmark shows ~5x speedup for this operation. **overhead_ratio=1.05, overhead_ratio_allcallers=1.06** (was 1.07). baseline_ms=74.3, zeal_ms=78.5, zeal_allcallers_ms=79.1. All 57 tests pass. KEPT.
+6. **3e89e08** — Lazy-cache `ZEAL_SHOW_ALL_CALLERS` and `ZEAL_NPLUSONE_THRESHOLD` settings on the `NPlusOneContext` dataclass. On the first `notify()` call per context, settings are read via `hasattr()` and cached; subsequent calls (~429 per workload) use the cached value directly, eliminating ~1.2us of `hasattr(settings, ...)` overhead per call (two `hasattr` calls at ~0.6us each). Profiling showed these two `hasattr` calls accounted for ~0.5ms of the ~2.2ms total overhead. **overhead_ratio=1.04, overhead_ratio_allcallers=1.03** (was 1.05/1.06). Across 4 benchmark runs: overhead_ratio ranged 0.99-1.06, overhead_ratio_allcallers ranged 0.98-1.07. All 57 tests pass. KEPT.
+7. **NOISE FLOOR REACHED** — Detailed profiling showed the total differential zeal overhead is ~0.23ms per workload (430 notify() calls at ~0.53us extra each), representing ~0.3% of the ~75ms baseline. Benchmark measurement noise (stdev ~2-5ms, or ~3-7%) is 10-20x larger than the signal. Three consecutive benchmark runs on identical code yielded overhead_ratio = 1.03, 1.09, 1.02 — confirming the remaining variance is measurement noise, not real overhead. Component breakdown: get_caller_fast() frame walking is the single largest remaining cost at 0.199us/call (29% of notify()), but it's irreducible since it must walk ~4.7 frames on average to find user code. No change made. NOISE FLOOR.
diff --git a/auto/autoresearch.sh b/auto/autoresearch.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+# Autoresearch runner for django-zeal performance optimization
+# Runs: unit tests (correctness gate) → performance benchmark
+# Outputs METRIC lines for the agent to parse
+# Exit code 0 = all good, non-zero = broken
+set -euo pipefail
+
+cd "$(dirname "$0")/.."
+
+# ── Step 1: Unit tests (correctness gate) ───────────────────────────
+echo "=== Unit Tests ==="
+if ! .venv/bin/pytest tests/ -x -q --tb=short 2>&1; then
+  echo "FATAL: unit tests failed"
+  exit 1
+fi
+
+# ── Step 2: Performance benchmark ───────────────────────────────────
+echo ""
+echo "=== Performance Benchmark ==="
+.venv/bin/python auto/bench.py 15 5 2>&1
diff --git a/auto/bench.py b/auto/bench.py
@@ -0,0 +1,138 @@
+#!/usr/bin/env python
+"""
+Benchmark: measures zeal overhead ratio (zeal-enabled vs disabled).
+
+Outputs machine-readable METRIC lines for autoresearch consumption.
+Uses median of N iterations (default 15) after warmup for stability.
+"""
+
+import gc
+import os
+import statistics
+import sys
+import time
+from contextlib import contextmanager
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "tests"))
+
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangoproject.settings")
+
+import django
+
+django.setup()
+
+from django.core.management import call_command
+
+call_command("migrate", "--run-syncdb", verbosity=0)
+
+from django.conf import settings
+
+from djangoproject.social.models import Post, Profile, User
+from factories import PostFactory, ProfileFactory, UserFactory
+from zeal import zeal_context, zeal_ignore
+
+
+def setup_data():
+    users = UserFactory.create_batch(10)
+    rels = []
+    for u in users:
+        for f in users:
+            if u != f:
+                rels.append(
+                    User.following.through(from_user_id=u.id, to_user_id=f.id)
+                )
+    User.following.through.objects.bulk_create(rels)
+    for u in users:
+        ProfileFactory(user=u)
+    for u in users:
+        PostFactory.create_batch(10, author=u)
+
+
+def workload():
+    posts = Post.objects.all()
+    for post in posts:
+        _ = post.author.username
+        _ = list(post.author.posts.all())
+
+    profiles = Profile.objects.all()
+    for profile in profiles:
+        _ = profile.user.username
+        _ = profile.user.profile.display_name
+
+    users = User.objects.all()
+    for user in users:
+        _ = list(user.following.all())
+        _ = list(user.followers.all())
+        _ = list(user.blocked.all())
+        for follower in user.followers.all():
+            _ = follower.profile.display_name
+            _ = list(follower.posts.all())
+
+
+def bench(label, wrapper, n, warmup):
+    for _ in range(warmup):
+        gc.collect()
+        with wrapper():
+            workload()
+
+    times = []
+    for _ in range(n):
+        gc.collect()
+        start = time.perf_counter()
+        with wrapper():
+            workload()
+        elapsed = time.perf_counter() - start
+        times.append(elapsed * 1000)
+
+    median = statistics.median(times)
+    mean = statistics.mean(times)
+    stdev = statistics.stdev(times) if len(times) > 1 else 0
+    mn = min(times)
+    mx = max(times)
+    print(
+        f"  {label}: median={median:.1f}ms mean={mean:.1f}ms "
+        f"stdev={stdev:.1f}ms min={mn:.1f}ms max={mx:.1f}ms (n={n})"
+    )
+    return median
+
+
+@contextmanager
+def noop_ctx():
+    yield
+
+
+@contextmanager
+def zeal_ctx():
+    with zeal_context(), zeal_ignore():
+        yield
+
+
+@contextmanager
+def zeal_all_callers_ctx():
+    settings.ZEAL_SHOW_ALL_CALLERS = True
+    try:
+        with zeal_context(), zeal_ignore():
+            yield
+    finally:
+        settings.ZEAL_SHOW_ALL_CALLERS = False
+
+
+if __name__ == "__main__":
+    n = int(sys.argv[1]) if len(sys.argv) > 1 else 15
+    warmup = int(sys.argv[2]) if len(sys.argv) > 2 else 5
+    setup_data()
+    print(f"Benchmark: {n} iterations, {warmup} warmup\n")
+
+    baseline_ms = bench("baseline (no zeal)", noop_ctx, n, warmup)
+    zeal_ms = bench("with zeal", zeal_ctx, n, warmup)
+    zeal_allcallers_ms = bench("with zeal (SHOW_ALL_CALLERS)", zeal_all_callers_ctx, n, warmup)
+    overhead_ratio = zeal_ms / baseline_ms
+    overhead_ratio_allcallers = zeal_allcallers_ms / baseline_ms
+
+    print()
+    print(f"METRIC baseline_ms={baseline_ms:.1f}")
+    print(f"METRIC zeal_ms={zeal_ms:.1f}")
+    print(f"METRIC overhead_ratio={overhead_ratio:.2f}")
+    print(f"METRIC zeal_allcallers_ms={zeal_allcallers_ms:.1f}")
+    print(f"METRIC overhead_ratio_allcallers={overhead_ratio_allcallers:.2f}")
diff --git a/pyproject.toml b/pyproject.toml
@@ -14,6 +14,7 @@ where = ["src"]
 
 [tool.ruff]
 line-length = 79
+exclude = ["auto", ".venv"]
 
 [tool.ruff.lint]
 extend-select = [
@@ -29,6 +30,7 @@ extend-select = [
 
 [tool.pyright]
 include = ["src", "tests"]
+exclude = ["auto", ".venv"]
 
 [tool.pytest.ini_options]
 DJANGO_SETTINGS_MODULE = "djangoproject.settings"