Add OCR fallback for scanned/non-searchable PDFs (#1156) by Sghosh1999 · Pull Request #1268 · microsoft/markitdown

Sghosh1999 · 2025-05-25T15:01:19Z

Description

Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.

Changes

Updated PdfConverter to first attempt text extraction with pdfminer as before.
If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
Added clear error messages if OCR dependencies are missing.
Updated documentation/comments to include installation instructions for new dependencies.

Example Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content)  # Will show OCR-extracted text if the PDF was not searchable

Related Issues

Closes #1156 — Pdf file conversion not working when pdf file is non scanable

Sghosh1999 · 2025-05-25T15:02:31Z

@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

afourney · 2025-05-28T16:59:26Z

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

Sghosh1999 · 2025-05-30T21:41:30Z

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts.

gjmveloso · 2025-06-06T23:21:37Z

I had thought on very similar feature but leveraging an optional llm_client instead.

#1285

dillonstreator · 2025-06-10T21:49:05Z

packages/markitdown/src/markitdown/converters/_pdf_converter.py

+        file_stream.seek(0)
+        text = pdfminer.high_level.extract_text(file_stream)
+        if text and text.strip():
+            return DocumentConverterResult(markdown=text)
+
+        # If no text found, fall back to OCR
+        if _ocr_dependency_exc_info is not None:
+            raise MissingDependencyException(
+                "OCR dependencies are missing. Please install pytesseract and pdf2image for OCR support."
+            ) from _ocr_dependency_exc_info[1].with_traceback(_ocr_dependency_exc_info[2])
+
+        file_stream.seek(0)
+        images = convert_from_bytes(file_stream.read())
+        ocr_text = []
+        for img in images:
+            ocr_text.append(pytesseract.image_to_string(img))
+        ocr_output = "\n\n".join(ocr_text)
+        return DocumentConverterResult(markdown=ocr_output)


There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.

related discussion #1285 (comment)

Sghosh1999 added 2 commits May 25, 2025 19:43

Add OCR fallback for non-searchable PDFs (fixes microsoft#1156)

35e32d5

Added pytesseract and pdf2image dependency.

523f796

dillonstreator reviewed Jun 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OCR fallback for scanned/non-searchable PDFs (#1156)#1268

Add OCR fallback for scanned/non-searchable PDFs (#1156)#1268
Sghosh1999 wants to merge 2 commits intomicrosoft:mainfrom
Sghosh1999:fix-pdf-ocr-support

Sghosh1999 commented May 25, 2025

Uh oh!

Sghosh1999 commented May 25, 2025

Uh oh!

afourney commented May 28, 2025

Uh oh!

Sghosh1999 commented May 30, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Uh oh!

dillonstreator Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sghosh1999 commented May 25, 2025

Description

Changes

Example Usage

Related Issues

Uh oh!

Sghosh1999 commented May 25, 2025

Uh oh!

afourney commented May 28, 2025

Uh oh!

Sghosh1999 commented May 30, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

dillonstreator Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants