Feat/phase 12 performance benchmarking by DanielDeshmukh · Pull Request #13 · DanielDeshmukh/Hector

DanielDeshmukh · 2026-06-28T13:17:25Z

Summary by CodeRabbit

New Features
- Search responses can now include stage-by-stage timing details when available.
- Added benchmarking tools for quick profiling, single runs, and sweep comparisons.
- Introduced configuration files and sample queries to support repeatable benchmark runs.
Bug Fixes
- Search results now return richer performance metadata alongside standard response data.
Tests
- Added regression coverage for benchmark config loading, query handling, and reporting behavior.

Create benchmark/ directory with config-driven performance benchmarking CLI adapted from NVIDIA rag-perf patterns. benchmark/configs/quick_profile.yaml: - Profile-only mode (~30s runtime) - 3 warmup + 10 profile requests - No load testing — fast iteration on retrieval tuning benchmark/configs/single_run.yaml: - One concurrency level (5) with profiling + load test (~2 min) - 5 warmup + 20 profile requests - 3 iterations at concurrency 5 benchmark/configs/sweep.yaml: - Multi-axis sweep: concurrency [1,5,10,20] x top_k [5,10,20] - 12 grid points for bottleneck analysis - Cartesian product with configurable sleep between points benchmark/queries.jsonl: - 25 legal queries covering all HECTOR route types - JSONL format for streaming load benchmark/rag_benchmark.py: - Config-driven CLI: -c <config.yaml> - Phase 1: Profiling — sequential queries with latency stats * avg, median, p95, p99, min, max, std dev - Phase 2: Load test — concurrent queries at specified concurrency * throughput (QPS), error rate, latency percentiles - Phase 3: Sweep — Cartesian product across axis combinations - Outputs: report.json, report.md, profiling_results.json, load_test_results.json, sweep_results.json, sweep_results.csv requirements.txt: - Added pyyaml~=6.0.0

Add timing instrumentation to HectorOrchestrator.execute() that measures each pipeline stage in milliseconds: - route_ms: Intent routing (keyword matching + LLM fallback) - normalize_ms: IPC-to-BNS query normalization - generate_ms: Retrieval + response generation - verify_ms: Chain-of-Verification (if enabled) - total_ms: End-to-end latency Timing is stored in self._last_timing and exposed via get_last_timing(). api/schemas.py: - Added stage_timings: dict | None field to SearchResponse model api/services.py: - Populates stage_timings from orchestrator._last_timing in search response This enables the benchmark CLI and API consumers to identify which pipeline stage is the bottleneck (routing, retrieval, generation, or verification).

benchmark/adapters/hector_adapter.py: - Profiles individual queries against HECTOR API - Extracts per-stage timing from SearchResponse.stage_timings - Identifies bottleneck stage (routing/normalization/generation/verification) - JSON output for programmatic analysis benchmark/sweep_comparison.py: - Reads sweep_results.json and prints formatted comparison table - Sorts by concurrency x top_k - Highlights optimal point (highest QPS) - Shows throughput, avg/p95 latency, error rate per grid point benchmark/__init__.py, benchmark/adapters/__init__.py: - Package markers for imports

Add 19 tests validating the performance benchmarking framework: TestBenchmarkConfig (4 tests): - Quick profile config loads correctly - Single run config has aiperf enabled with concurrency=5 - Sweep config loads list-valued axes (concurrency, top_k) - BenchConfig has sensible defaults TestQueryLoading (3 tests): - Loads 20+ queries from benchmark/queries.jsonl - Empty JSONL returns empty list - Blank lines are skipped TestBenchmarkMetrics (3 tests): - Percentile computation correctness - Throughput = requests / elapsed_seconds - Error rate = errors / total * 100 TestHectorAdapter (2 tests): - Adapter module is importable - Connection errors handled gracefully TestBenchmarkCli (4 tests): - rag_benchmark module importable with all expected functions - sweep_comparison module importable - generate_report produces valid structure - Report includes load test data when provided TestPerformanceThresholds (3 tests): - Config loading < 100ms (10 iterations) - Query loading < 50ms (10 iterations) - Report generation < 200ms Total: 900 tests passing across all test files

coderabbitai · 2026-06-28T13:17:37Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 84cb5606-1afd-4766-9ecc-0e4b848e6838

📥 Commits

Reviewing files that changed from the base of the PR and between b10fb0d and 1e7f9d8.

📒 Files selected for processing (14)

api/schemas.py
api/services.py
benchmark/__init__.py
benchmark/adapters/__init__.py
benchmark/adapters/hector_adapter.py
benchmark/configs/quick_profile.yaml
benchmark/configs/single_run.yaml
benchmark/configs/sweep.yaml
benchmark/queries.jsonl
benchmark/rag_benchmark.py
benchmark/sweep_comparison.py
core/orchestrator.py
requirements.txt
tests/test_perf_benchmark.py

📝 Walkthrough

Walkthrough

Adds per-stage timing instrumentation to HectorOrchestrator.execute(), stores timings in _last_timing, and exposes them through SearchResponse.stage_timings in the API. Introduces a new benchmark package with config-driven profiling, load testing, sweep execution, report generation, a standalone adapter CLI, a sweep comparison script, sample YAML configs, benchmark queries, and a test suite.

Changes

Stage Timing Instrumentation and API Exposure

Layer / File(s)	Summary
Orchestrator per-stage timing `core/orchestrator.py`	Adds `import time`, wraps route/normalize/generate/verify steps with `perf_counter` timestamps in `execute()`, stores computed millisecond deltas in `self._last_timing`, and adds `get_last_timing()` accessor.
SearchResponse field and service wiring `api/schemas.py`, `api/services.py`	Adds optional `stage_timings: dict \| None` field to `SearchResponse` and populates it from `orchestrator._last_timing` via `getattr` in `HectorApiService.search()`.

Benchmark Framework

Layer / File(s)	Summary
Config schema, YAML loading, and queries `benchmark/__init__.py`, `benchmark/adapters/__init__.py`, `benchmark/rag_benchmark.py` (L1–126), `benchmark/queries.jsonl`, `requirements.txt`	Defines `BenchConfig` and sub-config dataclasses, implements `load_config` (YAML→dataclass) and `load_queries` (JSONL), adds 25 benchmark queries, and pins `pyyaml~=6.0.0`.
HTTP client, profiling, load test, and sweep `benchmark/rag_benchmark.py` (L127–355)	Implements `query_hector` POST wrapper with latency capture, `run_profiling` (warmup + sequential timed requests + percentile stats), `run_load_test` (concurrent `ThreadPoolExecutor` with QPS aggregation), and `run_sweep` (parameter-space iteration over concurrency × top_k).
Report generation, persistence, and CLI `benchmark/rag_benchmark.py` (L356–587)	Implements `generate_report`, `save_results` (JSON/CSV/Markdown output), `_format_markdown_report`, and `main()` that wires config loading through execution phases to saved output paths.
Standalone hector_adapter CLI `benchmark/adapters/hector_adapter.py`	Adds `profile_query()` single-request HTTP adapter with stage timing extraction and `main()` CLI that prints JSON results and identifies the bottleneck stage.
Sweep comparison script `benchmark/sweep_comparison.py`	Loads `sweep_results.json`, filters/sorts points by concurrency and top_k, prints a formatted table, and selects the optimal configuration by maximum throughput QPS.
Sample configs and tests `benchmark/configs/*.yaml`, `tests/test_perf_benchmark.py`	Adds `quick_profile`, `single_run`, and `sweep` YAML configs; tests cover config loading, query loading, metric math, adapter error propagation, CLI entrypoints, report structure, and performance regression thresholds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the stages I go,
Timing each step with a perf_counter glow.
Routes, normalize, generate—check!
Sweep and compare, no config a wreck.
My benchmarks now run in a YAML-fed race,
The slowest stage found at a bunny's swift pace! 🕐

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase-12-performance-benchmarking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

DanielDeshmukh added 4 commits June 28, 2026 18:17

DanielDeshmukh merged commit 87c0ed1 into main Jun 28, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/phase 12 performance benchmarking#13

Feat/phase 12 performance benchmarking#13
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-12-performance-benchmarking

DanielDeshmukh commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanielDeshmukh commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DanielDeshmukh commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading