Feat/phase 10 corpus expansion by DanielDeshmukh · Pull Request #11 · DanielDeshmukh/Hector

DanielDeshmukh · 2026-06-27T17:46:19Z

Summary by CodeRabbit

New Features
- Added support for ingesting standalone .txt files alongside PDFs.
- Improved handling for scanned PDFs with an additional OCR fallback, helping recover more text when initial extraction is incomplete.
Bug Fixes
- Better handling for empty, short, unreadable, or already-processed files during ingestion.
- Reduced duplicate processing by skipping content already stored.

Add process_txt_book() method to EnhancedHectorIngestor that ingests from pre-OCR'd text files. This enables ingestion of scanned PDFs that were OCR'd externally (via scripts/ocr_scanned_pdfs.py) and saved as .txt alongside the PDF. The run() method now picks up .txt files from data/Books/ when no corresponding .pdf exists, treating each .txt as a single continuous page. This integrates the 4 existing OCR-extracted .txt files (Family_Courts_Act, Gram_Nyayalayas_Act, Juvenile_Justice_Act, Prevention_of_Corruption_Act) into the ingestion pipeline. Features: - Resume support (skips already-ingested .txt books) - Content hash deduplication - Legal structure metadata enrichment - Same chunking and quality gates as PDF processing - Uses .pdf filename for metadata (not .txt)

Add _nvidia_ocr_fallback() method to EnhancedHectorIngestor that uses the NVIDIA Nemotron OCR API as a final fallback for scanned pages that failed both pypdf text extraction and Tesseract OCR. The method: - Renders the PDF page to PNG via pdf2image - Sends it to NVIDIA Nemotron OCR endpoint (base64-encoded) - Parses the markdown/text response - Requires NVIDIA_API_KEY env var - Falls back gracefully if key is missing or API call fails - 60-second timeout per page This completes the 3-tier OCR fallback chain: 1. pypdf text extraction (fastest) 2. Tesseract OCR (local, moderate quality) 3. NVIDIA Nemotron OCR (API, highest quality for scanned docs)

Add 10 new tests to test_enhanced_ingestor.py covering the new corpus expansion features: TestProcessTxtBook (7 tests): - Basic .txt ingestion produces chunks via ChromaDB - Empty .txt files reported as 'empty' status - Short (<20 char) .txt files reported as 'empty' - .txt metadata uses .pdf extension for source field - Already-ingested .txt books are skipped (resume support) - ChromaDB page hash dedup skips duplicate content - Missing file returns error status TestNvidiaOcrFallback (3 tests): - Missing NVIDIA_API_KEY returns empty string - Internal errors caught gracefully (missing pdf2image) - Method exists and is callable Total: 857 tests passing across all test files

Add DeprecationWarning and logging warning to utils/ingestor.py directing users to utils.enhanced_ingestor instead. The legacy ingestor lacks boundary-aware chunking, NVIDIA OCR fallback, resume support, and validation gates present in the enhanced ingestor.

coderabbitai · 2026-06-27T17:46:30Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 969fa229-6598-47f1-947a-e17086b67c5d

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4617d and c052189.

📒 Files selected for processing (3)

tests/test_enhanced_ingestor.py
utils/enhanced_ingestor.py
utils/ingestor.py

📝 Walkthrough

Walkthrough

Adds _nvidia_ocr_fallback to EnhancedHectorIngestor as a final OCR step using the NVIDIA Nemotron REST API, introduces process_txt_book for ingesting standalone .txt files, updates run() to discover and dispatch both file types, and adds a module-level deprecation warning to utils/ingestor.py.

Enhanced Ingestor — OCR Fallback, TXT Ingestion, and Deprecation

Layer / File(s)	Summary
NVIDIA OCR fallback `utils/enhanced_ingestor.py`, `tests/test_enhanced_ingestor.py`	`_nvidia_ocr_fallback` renders a PDF page to PNG, posts it to the NVIDIA Nemotron OCR endpoint, extracts text from the response, and returns `""` on any failure. Integrated into per-page extraction when text is empty or below threshold. Tests assert `""` on missing key and on internal exceptions.
TXT ingestion and run() dispatch `utils/enhanced_ingestor.py`, `tests/test_enhanced_ingestor.py`	`process_txt_book` reads a `.txt` file as a single page, deduplicates by `page_hash`, builds chunks with `.pdf` source naming, writes to Chroma, and marks completion. `run()` now discovers standalone `.txt` files and dispatches them to `process_txt_book`. `TestProcessTxtBook` covers success, empty input, skip, hash dedup, and read errors.
ingestor.py deprecation `utils/ingestor.py`	Emits a `DeprecationWarning` and logger warning at import time directing users to `utils.enhanced_ingestor.HectorIngestor`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A .txt hops in, no PDF in sight,
NVIDIA's OCR shines its light,
Old ingestor warned, "I'm fading away,"
Chunks find their home in Chroma to stay,
The rabbit ingests every page just right! ✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase-10-corpus-expansion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

DanielDeshmukh added 4 commits June 27, 2026 22:35

DanielDeshmukh merged commit f31aa43 into main Jun 27, 2026
2 of 4 checks passed

coderabbitai Bot mentioned this pull request Jun 28, 2026

Feat/phase 14 nemo retriever #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/phase 10 corpus expansion#11

Feat/phase 10 corpus expansion#11
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-10-corpus-expansion

DanielDeshmukh commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Review failed

Walkthrough

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanielDeshmukh commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DanielDeshmukh commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading