Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285
Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285gjmveloso wants to merge 5 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
| prompt=llm_prompt, | ||
| ) | ||
|
|
||
| return DocumentConverterResult(markdown=str(markdown)) |
There was a problem hiding this comment.
There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.
There was a problem hiding this comment.
Are you thinking on something like replacing the usage of extract_text with extract_pages and iterate over its non-text elements, like LTImage and LTFigure?
Layout system reference:
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#topic-pdf-to-text-layout
There was a problem hiding this comment.
Yes - that would allow a much more reliable, predictable, and comprehensive text extraction.
- Proper handling of file_stream positioning after an empty result from pdfminer
- Resolve merge conflicts that were baked into the previous commits - Add llm_caption import and two prompt constants (_PDF_IMAGE_LLM_PROMPT, _PDF_FULL_LLM_PROMPT) to avoid inline prompt strings - Add _collect_lt_images() and _get_lt_image_data() helpers for extracting JPEG/JPEG2000 image data from pdfminer LTImage objects; use pdfminer's own LITERALS_DCT_DECODE / LITERALS_JPX_DECODE for filter comparison instead of fragile PSLiteral string conversion - When no form pages are detected, use pdfminer extract_text for prose quality, then do a second pass with extract_pages to find LTFigure elements containing embedded images and caption each one via the LLM - Add last-resort whole-document LLM fallback for fully non-searchable PDFs where no captionable images were found - Guard _merge_partial_numbering_lines call against None return from llm_caption
c83bacc to
6742995
Compare
Initial work to attempt to use LLM to perform OCR operations within a PDF when
pdfminerreturns empty text