Add OCR fallback for scanned/non-searchable PDFs (#1156)#1268
Add OCR fallback for scanned/non-searchable PDFs (#1156)#1268Sghosh1999 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
@microsoft-github-policy-service agree |
|
Thanks for the contribution. This looks promising. Let me do some testing. NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text? |
I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts. |
|
I had thought on very similar feature but leveraging an optional |
| file_stream.seek(0) | ||
| text = pdfminer.high_level.extract_text(file_stream) | ||
| if text and text.strip(): | ||
| return DocumentConverterResult(markdown=text) | ||
|
|
||
| # If no text found, fall back to OCR | ||
| if _ocr_dependency_exc_info is not None: | ||
| raise MissingDependencyException( | ||
| "OCR dependencies are missing. Please install pytesseract and pdf2image for OCR support." | ||
| ) from _ocr_dependency_exc_info[1].with_traceback(_ocr_dependency_exc_info[2]) | ||
|
|
||
| file_stream.seek(0) | ||
| images = convert_from_bytes(file_stream.read()) | ||
| ocr_text = [] | ||
| for img in images: | ||
| ocr_text.append(pytesseract.image_to_string(img)) | ||
| ocr_output = "\n\n".join(ocr_text) | ||
| return DocumentConverterResult(markdown=ocr_output) |
There was a problem hiding this comment.
There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.
There was a problem hiding this comment.
related discussion #1285 (comment)
Description
Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.
Changes
PdfConverterto first attempt text extraction with pdfminer as before.Example Usage
Related Issues
Closes #1156 — Pdf file conversion not working when pdf file is non scanable