-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add languages parameter to hOCR output metadata #4531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -18,9 +18,12 @@ | |||||||||||||||||||||||||||||||||
| **********************************************************************/ | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| #include <tesseract/baseapi.h> // for TessBaseAPI | ||||||||||||||||||||||||||||||||||
| #include <cstring> // for strcmp | ||||||||||||||||||||||||||||||||||
| #include <locale> // for std::locale::classic | ||||||||||||||||||||||||||||||||||
| #include <memory> // for std::unique_ptr | ||||||||||||||||||||||||||||||||||
| #include <sstream> // for std::stringstream | ||||||||||||||||||||||||||||||||||
| #include <unordered_map> // for std::unordered_map | ||||||||||||||||||||||||||||||||||
| #include <unordered_set> // for std::unordered_set | ||||||||||||||||||||||||||||||||||
| #include <tesseract/renderer.h> | ||||||||||||||||||||||||||||||||||
| #include "helpers.h" // for copy_string | ||||||||||||||||||||||||||||||||||
| #include "tesseractclass.h" // for Tesseract | ||||||||||||||||||||||||||||||||||
|
|
@@ -477,6 +480,60 @@ TessHOcrRenderer::TessHOcrRenderer(const char *outputbase, bool font_info) | |||||||||||||||||||||||||||||||||
| font_info_ = font_info; | ||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| void TessHOcrRenderer::SetInputLanguages(const char *languages) { | ||||||||||||||||||||||||||||||||||
| if (languages) { | ||||||||||||||||||||||||||||||||||
| input_languages_ = languages; | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
| if (languages) { | |
| input_languages_ = languages; | |
| if (languages && languages[0] != '\0') { | |
| input_languages_ = languages; | |
| } else { | |
| input_languages_.clear(); |
Copilot
AI
Mar 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iso639Map() allocates the map with new and intentionally leaks it for the lifetime of the process. Unless there is a specific reason to avoid static destruction here, prefer a function-local static const std::unordered_map<...> value to avoid heap allocation/leaks and simplify the code.
Copilot
AI
Mar 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScriptNames() also uses new for a function-local static and leaks it. Prefer a function-local static const std::unordered_set<std::string> object unless there's a measured reason to avoid its destructor at shutdown.
Copilot
AI
Mar 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new header metadata behavior (splitting/mapping languages and emitting new meta tags) isn't covered by existing unit tests. Since src/api/hocrrenderer.cpp already has test coverage for other hOCR functionality, it would be good to add a focused test that exercises TessHOcrRenderer::BeginDocumentHandler() output with (1) languages set and (2) languages unset, and asserts the exact meta tag(s) emitted (including proper escaping).
Copilot
AI
Mar 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation transforms the -l string by splitting on +, converting some tokens (e.g. eng->en) and joining with spaces, so the emitted <meta name='ocr-langs' ...> does not preserve the original -l value (e.g. eng+fra). That differs from the PR description/issue request and also makes ocr-langs inconsistent with the existing lang='eng' attributes emitted elsewhere in hOCR. Please either emit the raw -l value (escaped) or update the approach/spec and ensure consistency across the document.
Copilot
AI
Mar 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ocr-langs / ocr-scripts meta tag contents are derived from the user-controlled -l argument (via GetInitLanguagesAsString()), but they are appended without HTML/XML escaping. This allows malformed output or HTML injection if the language string contains characters like ', &, <, etc. Consider escaping with the existing tesseract::HOcrEscape() helper before emitting into the hOCR header.
| AppendString(" <meta name='ocr-langs' content='"); | |
| AppendString(langs.c_str()); | |
| AppendString("' />\n"); | |
| } | |
| if (!scripts.empty()) { | |
| AppendString(" <meta name='ocr-scripts' content='"); | |
| AppendString(scripts.c_str()); | |
| std::string escaped_langs = HOcrEscape(langs.c_str()); | |
| AppendString(" <meta name='ocr-langs' content='"); | |
| AppendString(escaped_langs.c_str()); | |
| AppendString("' />\n"); | |
| } | |
| if (!scripts.empty()) { | |
| std::string escaped_scripts = HOcrEscape(scripts.c_str()); | |
| AppendString(" <meta name='ocr-scripts' content='"); | |
| AppendString(escaped_scripts.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding
std::string input_languages_toTessHOcrRendererchanges the size/layout of this exported (TESS_API) C++ class, which can break binary compatibility for downstream code that links against libtesseract and instantiatesTessHOcrRenderer. If ABI stability is a concern, consider storing the new state behind an indirection (pimpl/opaque pointer) or in a separate internal structure to minimize ABI impact.