Skip to content

Fix inconsistent recognition_done_ state on empty pages#4528

Open
eyupcanakman wants to merge 1 commit intotesseract-ocr:mainfrom
eyupcanakman:fix/recognition-done-empty-page
Open

Fix inconsistent recognition_done_ state on empty pages#4528
eyupcanakman wants to merge 1 commit intotesseract-ocr:mainfrom
eyupcanakman:fix/recognition-done-empty-page

Conversation

@eyupcanakman
Copy link
Copy Markdown

@eyupcanakman eyupcanakman commented Mar 18, 2026

When Recognize() encounters an empty page (no text blocks detected), it sets page_res_ but returns without setting recognition_done_ to true.

Some renderers (hOCR, ALTO, TSV) check page_res_ == nullptr to decide whether to re-run recognition, while others (GetUTF8Text, GetBoxText, GetUNLVText) check !recognition_done_. The second group triggers a redundant Recognize() call on empty pages. If the second pass non-deterministically finds text, later renderers get text while earlier ones (hOCR) return empty output.

Set recognition_done_ = true in the empty-page early-return path, same as the non-empty path. Add a regression test that verifies hOCR, UTF8, and TSV output are all non-null after recognizing a blank image.

Fixes #4112

@eyupcanakman eyupcanakman force-pushed the fix/recognition-done-empty-page branch from 003e38e to 90928a9 Compare March 18, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tesseract creates hOCR output without text results

1 participant