Feat/phase 12 performance benchmarking#13
Conversation
Create benchmark/ directory with config-driven performance benchmarking CLI adapted from NVIDIA rag-perf patterns. benchmark/configs/quick_profile.yaml: - Profile-only mode (~30s runtime) - 3 warmup + 10 profile requests - No load testing — fast iteration on retrieval tuning benchmark/configs/single_run.yaml: - One concurrency level (5) with profiling + load test (~2 min) - 5 warmup + 20 profile requests - 3 iterations at concurrency 5 benchmark/configs/sweep.yaml: - Multi-axis sweep: concurrency [1,5,10,20] x top_k [5,10,20] - 12 grid points for bottleneck analysis - Cartesian product with configurable sleep between points benchmark/queries.jsonl: - 25 legal queries covering all HECTOR route types - JSONL format for streaming load benchmark/rag_benchmark.py: - Config-driven CLI: -c <config.yaml> - Phase 1: Profiling — sequential queries with latency stats * avg, median, p95, p99, min, max, std dev - Phase 2: Load test — concurrent queries at specified concurrency * throughput (QPS), error rate, latency percentiles - Phase 3: Sweep — Cartesian product across axis combinations - Outputs: report.json, report.md, profiling_results.json, load_test_results.json, sweep_results.json, sweep_results.csv requirements.txt: - Added pyyaml~=6.0.0
Add timing instrumentation to HectorOrchestrator.execute() that measures each pipeline stage in milliseconds: - route_ms: Intent routing (keyword matching + LLM fallback) - normalize_ms: IPC-to-BNS query normalization - generate_ms: Retrieval + response generation - verify_ms: Chain-of-Verification (if enabled) - total_ms: End-to-end latency Timing is stored in self._last_timing and exposed via get_last_timing(). api/schemas.py: - Added stage_timings: dict | None field to SearchResponse model api/services.py: - Populates stage_timings from orchestrator._last_timing in search response This enables the benchmark CLI and API consumers to identify which pipeline stage is the bottleneck (routing, retrieval, generation, or verification).
benchmark/adapters/hector_adapter.py: - Profiles individual queries against HECTOR API - Extracts per-stage timing from SearchResponse.stage_timings - Identifies bottleneck stage (routing/normalization/generation/verification) - JSON output for programmatic analysis benchmark/sweep_comparison.py: - Reads sweep_results.json and prints formatted comparison table - Sorts by concurrency x top_k - Highlights optimal point (highest QPS) - Shows throughput, avg/p95 latency, error rate per grid point benchmark/__init__.py, benchmark/adapters/__init__.py: - Package markers for imports
Add 19 tests validating the performance benchmarking framework: TestBenchmarkConfig (4 tests): - Quick profile config loads correctly - Single run config has aiperf enabled with concurrency=5 - Sweep config loads list-valued axes (concurrency, top_k) - BenchConfig has sensible defaults TestQueryLoading (3 tests): - Loads 20+ queries from benchmark/queries.jsonl - Empty JSONL returns empty list - Blank lines are skipped TestBenchmarkMetrics (3 tests): - Percentile computation correctness - Throughput = requests / elapsed_seconds - Error rate = errors / total * 100 TestHectorAdapter (2 tests): - Adapter module is importable - Connection errors handled gracefully TestBenchmarkCli (4 tests): - rag_benchmark module importable with all expected functions - sweep_comparison module importable - generate_report produces valid structure - Report includes load test data when provided TestPerformanceThresholds (3 tests): - Config loading < 100ms (10 iterations) - Query loading < 50ms (10 iterations) - Report generation < 200ms Total: 900 tests passing across all test files
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
📝 WalkthroughWalkthroughAdds per-stage timing instrumentation to ChangesStage Timing Instrumentation and API Exposure
Benchmark Framework
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary by CodeRabbit
New Features
Bug Fixes
Tests