Skip to content

Feat/phase 10 corpus expansion#11

Merged
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-10-corpus-expansion
Jun 27, 2026
Merged

Feat/phase 10 corpus expansion#11
DanielDeshmukh merged 4 commits into
mainfrom
feat/phase-10-corpus-expansion

Conversation

@DanielDeshmukh

@DanielDeshmukh DanielDeshmukh commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added support for ingesting standalone .txt files alongside PDFs.
    • Improved handling for scanned PDFs with an additional OCR fallback, helping recover more text when initial extraction is incomplete.
  • Bug Fixes

    • Better handling for empty, short, unreadable, or already-processed files during ingestion.
    • Reduced duplicate processing by skipping content already stored.

Add process_txt_book() method to EnhancedHectorIngestor that ingests
from pre-OCR'd text files. This enables ingestion of scanned PDFs
that were OCR'd externally (via scripts/ocr_scanned_pdfs.py) and
saved as .txt alongside the PDF.

The run() method now picks up .txt files from data/Books/ when no
corresponding .pdf exists, treating each .txt as a single continuous
page. This integrates the 4 existing OCR-extracted .txt files
(Family_Courts_Act, Gram_Nyayalayas_Act, Juvenile_Justice_Act,
Prevention_of_Corruption_Act) into the ingestion pipeline.

Features:
- Resume support (skips already-ingested .txt books)
- Content hash deduplication
- Legal structure metadata enrichment
- Same chunking and quality gates as PDF processing
- Uses .pdf filename for metadata (not .txt)
Add _nvidia_ocr_fallback() method to EnhancedHectorIngestor that uses
the NVIDIA Nemotron OCR API as a final fallback for scanned pages
that failed both pypdf text extraction and Tesseract OCR.

The method:
- Renders the PDF page to PNG via pdf2image
- Sends it to NVIDIA Nemotron OCR endpoint (base64-encoded)
- Parses the markdown/text response
- Requires NVIDIA_API_KEY env var
- Falls back gracefully if key is missing or API call fails
- 60-second timeout per page

This completes the 3-tier OCR fallback chain:
1. pypdf text extraction (fastest)
2. Tesseract OCR (local, moderate quality)
3. NVIDIA Nemotron OCR (API, highest quality for scanned docs)
Add 10 new tests to test_enhanced_ingestor.py covering the new
corpus expansion features:

TestProcessTxtBook (7 tests):
- Basic .txt ingestion produces chunks via ChromaDB
- Empty .txt files reported as 'empty' status
- Short (<20 char) .txt files reported as 'empty'
- .txt metadata uses .pdf extension for source field
- Already-ingested .txt books are skipped (resume support)
- ChromaDB page hash dedup skips duplicate content
- Missing file returns error status

TestNvidiaOcrFallback (3 tests):
- Missing NVIDIA_API_KEY returns empty string
- Internal errors caught gracefully (missing pdf2image)
- Method exists and is callable

Total: 857 tests passing across all test files
Add DeprecationWarning and logging warning to utils/ingestor.py
directing users to utils.enhanced_ingestor instead. The legacy
ingestor lacks boundary-aware chunking, NVIDIA OCR fallback,
resume support, and validation gates present in the enhanced
ingestor.
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 969fa229-6598-47f1-947a-e17086b67c5d

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4617d and c052189.

📒 Files selected for processing (3)
  • tests/test_enhanced_ingestor.py
  • utils/enhanced_ingestor.py
  • utils/ingestor.py

📝 Walkthrough

Walkthrough

Adds _nvidia_ocr_fallback to EnhancedHectorIngestor as a final OCR step using the NVIDIA Nemotron REST API, introduces process_txt_book for ingesting standalone .txt files, updates run() to discover and dispatch both file types, and adds a module-level deprecation warning to utils/ingestor.py.

Enhanced Ingestor — OCR Fallback, TXT Ingestion, and Deprecation

Layer / File(s) Summary
NVIDIA OCR fallback
utils/enhanced_ingestor.py, tests/test_enhanced_ingestor.py
_nvidia_ocr_fallback renders a PDF page to PNG, posts it to the NVIDIA Nemotron OCR endpoint, extracts text from the response, and returns "" on any failure. Integrated into per-page extraction when text is empty or below threshold. Tests assert "" on missing key and on internal exceptions.
TXT ingestion and run() dispatch
utils/enhanced_ingestor.py, tests/test_enhanced_ingestor.py
process_txt_book reads a .txt file as a single page, deduplicates by page_hash, builds chunks with .pdf source naming, writes to Chroma, and marks completion. run() now discovers standalone .txt files and dispatches them to process_txt_book. TestProcessTxtBook covers success, empty input, skip, hash dedup, and read errors.
ingestor.py deprecation
utils/ingestor.py
Emits a DeprecationWarning and logger warning at import time directing users to utils.enhanced_ingestor.HectorIngestor.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A .txt hops in, no PDF in sight,
NVIDIA's OCR shines its light,
Old ingestor warned, "I'm fading away,"
Chunks find their home in Chroma to stay,
The rabbit ingests every page just right! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase-10-corpus-expansion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@DanielDeshmukh DanielDeshmukh merged commit f31aa43 into main Jun 27, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant