Feat/phase 10 corpus expansion#11
Conversation
Add process_txt_book() method to EnhancedHectorIngestor that ingests from pre-OCR'd text files. This enables ingestion of scanned PDFs that were OCR'd externally (via scripts/ocr_scanned_pdfs.py) and saved as .txt alongside the PDF. The run() method now picks up .txt files from data/Books/ when no corresponding .pdf exists, treating each .txt as a single continuous page. This integrates the 4 existing OCR-extracted .txt files (Family_Courts_Act, Gram_Nyayalayas_Act, Juvenile_Justice_Act, Prevention_of_Corruption_Act) into the ingestion pipeline. Features: - Resume support (skips already-ingested .txt books) - Content hash deduplication - Legal structure metadata enrichment - Same chunking and quality gates as PDF processing - Uses .pdf filename for metadata (not .txt)
Add _nvidia_ocr_fallback() method to EnhancedHectorIngestor that uses the NVIDIA Nemotron OCR API as a final fallback for scanned pages that failed both pypdf text extraction and Tesseract OCR. The method: - Renders the PDF page to PNG via pdf2image - Sends it to NVIDIA Nemotron OCR endpoint (base64-encoded) - Parses the markdown/text response - Requires NVIDIA_API_KEY env var - Falls back gracefully if key is missing or API call fails - 60-second timeout per page This completes the 3-tier OCR fallback chain: 1. pypdf text extraction (fastest) 2. Tesseract OCR (local, moderate quality) 3. NVIDIA Nemotron OCR (API, highest quality for scanned docs)
Add 10 new tests to test_enhanced_ingestor.py covering the new corpus expansion features: TestProcessTxtBook (7 tests): - Basic .txt ingestion produces chunks via ChromaDB - Empty .txt files reported as 'empty' status - Short (<20 char) .txt files reported as 'empty' - .txt metadata uses .pdf extension for source field - Already-ingested .txt books are skipped (resume support) - ChromaDB page hash dedup skips duplicate content - Missing file returns error status TestNvidiaOcrFallback (3 tests): - Missing NVIDIA_API_KEY returns empty string - Internal errors caught gracefully (missing pdf2image) - Method exists and is callable Total: 857 tests passing across all test files
Add DeprecationWarning and logging warning to utils/ingestor.py directing users to utils.enhanced_ingestor instead. The legacy ingestor lacks boundary-aware chunking, NVIDIA OCR fallback, resume support, and validation gates present in the enhanced ingestor.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds Enhanced Ingestor — OCR Fallback, TXT Ingestion, and Deprecation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary by CodeRabbit
New Features
.txtfiles alongside PDFs.Bug Fixes