Feat/phase 11 ragas evaluation by DanielDeshmukh · Pull Request #12 · DanielDeshmukh/Hector

DanielDeshmukh · 2026-06-27T18:18:56Z

No description provided.

Create evaluation/ directory with RAGAS-inspired retrieval quality evaluation framework for HECTOR. evaluation/train.json: - 25 legal QA test pairs covering all HECTOR route types - Categories: legal_research, cross_reference, constitutional_law, family_law, evidence_law - Each entry has: query, ground_truth, expected_sections, expected_acts, category - Covers IPC-to-BNS cross-references, fundamental rights, criminal law, contract law, and constitutional law evaluation/evaluate_rag.py: - Calls HECTOR POST /v1/search API for each query - Computes 4 RAGAS-inspired metrics: * Faithfulness: overlap between answer and retrieved contexts * Answer Relevance: keyword overlap with ground truth * Context Precision: fraction of contexts containing ground truth * Context Recall: ground truth sentences found in contexts - Computes citation quality metrics: * Section Recall: expected sections found in citations * Act Recall: expected acts found in citations - Performance metrics: avg latency, p95 latency - Per-category breakdown in results - JSON output: summary + per-query details - CLI with --dataset-paths, --host, --port, --top-k flags requirements.txt: - Added ragas~=0.2.0 and datasets>=2.0.0

evaluation/run_eval.sh: - One-command evaluation runner script - Checks HECTOR API connectivity before running - Supports CLI args: --host, --port, --top-k, --output-dir - Reads HECTOR_API_HOST, HECTOR_API_PORT, HECTOR_API_KEY env vars - Auto-detects API key from environment evaluation/analyze_results.py: - Parses RAGAS evaluation summary JSON files - Prints formatted summary tables with quality indicators - --compare mode: side-by-side baseline vs treatment comparison with delta and percentage change per metric - --trend-dir mode: prints trend across multiple evaluation runs - Per-category breakdown with faithfulness/relevance/recall scores - Color-coded improvement/regression indicators for latency metrics

Add 24 tests validating the RAGAS evaluation framework itself (not requiring a running HECTOR instance): TestDatasetLoading (6 tests): - Loads valid train.json with 25+ entries - Validates required fields (query, ground_truth) - Validates optional fields (expected_sections, acts, category) - Verifies category coverage (legal_research, cross_reference, etc.) - Cross-reference entries have expected sections/acts - Error handling: missing files, invalid JSON TestRagasMetrics (5 tests): - Perfect match yields positive relevance/faithfulness - Empty contexts yield zero context metrics - Empty answer yields zero answer metrics - Irrelevant contexts yield low precision - All metrics bounded between 0 and 1 TestCitationMetrics (5 tests): - Perfect section/act recall when all found - Partial recall when some missing - N/A recall when no expected sections - Section extraction from item metadata - Citation count accuracy TestResponseExtraction (5 tests): - Extract document texts from response items - Handle empty/missing items lists - Skip items with empty document field - Extract generated_response - Handle missing generated_response TestCliEntryPoint (3 tests): - Module is importable - CLI accepts expected argument flags - Argument parsing works correctly Total: 881 tests passing across all test files

coderabbitai · 2026-06-27T18:19:06Z

Warning

Review limit reached

@DanielDeshmukh, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 22 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d0c5f80-78d0-455c-b44a-5a175998af7a

📥 Commits

Reviewing files that changed from the base of the PR and between f31aa43 and e102b04.

📒 Files selected for processing (6)

evaluation/analyze_results.py
evaluation/evaluate_rag.py
evaluation/run_eval.sh
evaluation/train.json
requirements.txt
tests/test_rag_quality.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase-11-ragas-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

DanielDeshmukh added 3 commits June 27, 2026 23:34

DanielDeshmukh merged commit b10fb0d into main Jun 27, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/phase 11 ragas evaluation#12

Feat/phase 11 ragas evaluation#12
DanielDeshmukh merged 3 commits into
mainfrom
feat/phase-11-ragas-evaluation

DanielDeshmukh commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanielDeshmukh commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant