Skip to content

Feat/phase 12 performance benchmarking#13

Merged
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-12-performance-benchmarking
Jun 28, 2026
Merged

Feat/phase 12 performance benchmarking#13
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-12-performance-benchmarking

Conversation

@DanielDeshmukh

@DanielDeshmukh DanielDeshmukh commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Search responses can now include stage-by-stage timing details when available.
    • Added benchmarking tools for quick profiling, single runs, and sweep comparisons.
    • Introduced configuration files and sample queries to support repeatable benchmark runs.
  • Bug Fixes

    • Search results now return richer performance metadata alongside standard response data.
  • Tests

    • Added regression coverage for benchmark config loading, query handling, and reporting behavior.

Create benchmark/ directory with config-driven performance benchmarking
CLI adapted from NVIDIA rag-perf patterns.

benchmark/configs/quick_profile.yaml:
- Profile-only mode (~30s runtime)
- 3 warmup + 10 profile requests
- No load testing — fast iteration on retrieval tuning

benchmark/configs/single_run.yaml:
- One concurrency level (5) with profiling + load test (~2 min)
- 5 warmup + 20 profile requests
- 3 iterations at concurrency 5

benchmark/configs/sweep.yaml:
- Multi-axis sweep: concurrency [1,5,10,20] x top_k [5,10,20]
- 12 grid points for bottleneck analysis
- Cartesian product with configurable sleep between points

benchmark/queries.jsonl:
- 25 legal queries covering all HECTOR route types
- JSONL format for streaming load

benchmark/rag_benchmark.py:
- Config-driven CLI: -c <config.yaml>
- Phase 1: Profiling — sequential queries with latency stats
  * avg, median, p95, p99, min, max, std dev
- Phase 2: Load test — concurrent queries at specified concurrency
  * throughput (QPS), error rate, latency percentiles
- Phase 3: Sweep — Cartesian product across axis combinations
- Outputs: report.json, report.md, profiling_results.json,
  load_test_results.json, sweep_results.json, sweep_results.csv

requirements.txt:
- Added pyyaml~=6.0.0
Add timing instrumentation to HectorOrchestrator.execute() that
measures each pipeline stage in milliseconds:

- route_ms: Intent routing (keyword matching + LLM fallback)
- normalize_ms: IPC-to-BNS query normalization
- generate_ms: Retrieval + response generation
- verify_ms: Chain-of-Verification (if enabled)
- total_ms: End-to-end latency

Timing is stored in self._last_timing and exposed via get_last_timing().

api/schemas.py:
- Added stage_timings: dict | None field to SearchResponse model

api/services.py:
- Populates stage_timings from orchestrator._last_timing in search response

This enables the benchmark CLI and API consumers to identify which
pipeline stage is the bottleneck (routing, retrieval, generation,
or verification).
benchmark/adapters/hector_adapter.py:
- Profiles individual queries against HECTOR API
- Extracts per-stage timing from SearchResponse.stage_timings
- Identifies bottleneck stage (routing/normalization/generation/verification)
- JSON output for programmatic analysis

benchmark/sweep_comparison.py:
- Reads sweep_results.json and prints formatted comparison table
- Sorts by concurrency x top_k
- Highlights optimal point (highest QPS)
- Shows throughput, avg/p95 latency, error rate per grid point

benchmark/__init__.py, benchmark/adapters/__init__.py:
- Package markers for imports
Add 19 tests validating the performance benchmarking framework:

TestBenchmarkConfig (4 tests):
- Quick profile config loads correctly
- Single run config has aiperf enabled with concurrency=5
- Sweep config loads list-valued axes (concurrency, top_k)
- BenchConfig has sensible defaults

TestQueryLoading (3 tests):
- Loads 20+ queries from benchmark/queries.jsonl
- Empty JSONL returns empty list
- Blank lines are skipped

TestBenchmarkMetrics (3 tests):
- Percentile computation correctness
- Throughput = requests / elapsed_seconds
- Error rate = errors / total * 100

TestHectorAdapter (2 tests):
- Adapter module is importable
- Connection errors handled gracefully

TestBenchmarkCli (4 tests):
- rag_benchmark module importable with all expected functions
- sweep_comparison module importable
- generate_report produces valid structure
- Report includes load test data when provided

TestPerformanceThresholds (3 tests):
- Config loading < 100ms (10 iterations)
- Query loading < 50ms (10 iterations)
- Report generation < 200ms

Total: 900 tests passing across all test files
@DanielDeshmukh DanielDeshmukh merged commit 87c0ed1 into main Jun 28, 2026
2 of 4 checks passed
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 84cb5606-1afd-4766-9ecc-0e4b848e6838

📥 Commits

Reviewing files that changed from the base of the PR and between b10fb0d and 1e7f9d8.

📒 Files selected for processing (14)
  • api/schemas.py
  • api/services.py
  • benchmark/__init__.py
  • benchmark/adapters/__init__.py
  • benchmark/adapters/hector_adapter.py
  • benchmark/configs/quick_profile.yaml
  • benchmark/configs/single_run.yaml
  • benchmark/configs/sweep.yaml
  • benchmark/queries.jsonl
  • benchmark/rag_benchmark.py
  • benchmark/sweep_comparison.py
  • core/orchestrator.py
  • requirements.txt
  • tests/test_perf_benchmark.py

📝 Walkthrough

Walkthrough

Adds per-stage timing instrumentation to HectorOrchestrator.execute(), stores timings in _last_timing, and exposes them through SearchResponse.stage_timings in the API. Introduces a new benchmark package with config-driven profiling, load testing, sweep execution, report generation, a standalone adapter CLI, a sweep comparison script, sample YAML configs, benchmark queries, and a test suite.

Changes

Stage Timing Instrumentation and API Exposure

Layer / File(s) Summary
Orchestrator per-stage timing
core/orchestrator.py
Adds import time, wraps route/normalize/generate/verify steps with perf_counter timestamps in execute(), stores computed millisecond deltas in self._last_timing, and adds get_last_timing() accessor.
SearchResponse field and service wiring
api/schemas.py, api/services.py
Adds optional stage_timings: dict | None field to SearchResponse and populates it from orchestrator._last_timing via getattr in HectorApiService.search().

Benchmark Framework

Layer / File(s) Summary
Config schema, YAML loading, and queries
benchmark/__init__.py, benchmark/adapters/__init__.py, benchmark/rag_benchmark.py (L1–126), benchmark/queries.jsonl, requirements.txt
Defines BenchConfig and sub-config dataclasses, implements load_config (YAML→dataclass) and load_queries (JSONL), adds 25 benchmark queries, and pins pyyaml~=6.0.0.
HTTP client, profiling, load test, and sweep
benchmark/rag_benchmark.py (L127–355)
Implements query_hector POST wrapper with latency capture, run_profiling (warmup + sequential timed requests + percentile stats), run_load_test (concurrent ThreadPoolExecutor with QPS aggregation), and run_sweep (parameter-space iteration over concurrency × top_k).
Report generation, persistence, and CLI
benchmark/rag_benchmark.py (L356–587)
Implements generate_report, save_results (JSON/CSV/Markdown output), _format_markdown_report, and main() that wires config loading through execution phases to saved output paths.
Standalone hector_adapter CLI
benchmark/adapters/hector_adapter.py
Adds profile_query() single-request HTTP adapter with stage timing extraction and main() CLI that prints JSON results and identifies the bottleneck stage.
Sweep comparison script
benchmark/sweep_comparison.py
Loads sweep_results.json, filters/sorts points by concurrency and top_k, prints a formatted table, and selects the optimal configuration by maximum throughput QPS.
Sample configs and tests
benchmark/configs/*.yaml, tests/test_perf_benchmark.py
Adds quick_profile, single_run, and sweep YAML configs; tests cover config loading, query loading, metric math, adapter error propagation, CLI entrypoints, report structure, and performance regression thresholds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the stages I go,
Timing each step with a perf_counter glow.
Routes, normalize, generate—check!
Sweep and compare, no config a wreck.
My benchmarks now run in a YAML-fed race,
The slowest stage found at a bunny's swift pace! 🕐

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase-12-performance-benchmarking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant