Add languages parameter to hOCR output metadata#4531
Add languages parameter to hOCR output metadata#4531mvanhorn wants to merge 3 commits intotesseract-ocr:mainfrom
Conversation
Adds an 'ocr-langs' meta tag to the hOCR header containing the languages string passed via -l (e.g. "eng+fra"). This lets downstream tools know which language models were used for OCR. The implementation adds SetInputLanguages() to TessHOcrRenderer, called from the CLI before processing begins. The meta tag is only emitted when languages are set. Fixes tesseract-ocr#4455
|
Your implementation is currently not compatible with the spec.
You also need to take into account the script models, e.g. Latin, Greek, Hebrew |
|
@stweil, I didn't see your comment. I basically wrote the same thing roughly in parallel time.... :-) |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pushed a fix. Languages are now converted to ISO 639-1 codes (space-separated), and script models emit a separate Example with <meta name='ocr-langs' content='en fr' />
<meta name='ocr-scripts' content='Latin' /> |
|
@copilot, please review this pull request. It currently does not handle more complex model names like for example |
There was a problem hiding this comment.
Pull request overview
Adds language metadata to hOCR output so downstream consumers can determine which recognition languages were used.
Changes:
- Adds
TessHOcrRenderer::SetInputLanguages()and stores the input language string in the renderer. - Emits new
<meta name='ocr-langs' ...>(and currently alsoocr-scripts) tags in the hOCR document header. - Wires the CLI path to pass
TessBaseAPI::GetInitLanguagesAsString()into the hOCR renderer.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
include/tesseract/renderer.h |
Extends the public hOCR renderer API with a setter and new stored state. |
src/api/hocrrenderer.cpp |
Implements language handling and emits new hOCR header meta tags. |
src/tesseract.cpp |
Passes initialized languages into the hOCR renderer in the CLI renderer setup. |
| static const std::unordered_map<std::string, std::string> &Iso639Map() { | ||
| static const auto *map = new std::unordered_map<std::string, std::string>{ | ||
| {"afr", "af"}, {"amh", "am"}, {"ara", "ar"}, {"asm", "as"}, | ||
| {"aze", "az"}, {"bel", "be"}, {"ben", "bn"}, {"bod", "bo"}, |
There was a problem hiding this comment.
Iso639Map() allocates the map with new and intentionally leaks it for the lifetime of the process. Unless there is a specific reason to avoid static destruction here, prefer a function-local static const std::unordered_map<...> value to avoid heap allocation/leaks and simplify the code.
| static const std::unordered_set<std::string> &ScriptNames() { | ||
| static const auto *set = new std::unordered_set<std::string>{ | ||
| "Arabic", "Armenian", "Bengali", "Canadian_Aboriginal", "Cherokee", | ||
| "Cyrillic", "Devanagari", "Ethiopic", "Fraktur", "Georgian", "Greek", |
There was a problem hiding this comment.
ScriptNames() also uses new for a function-local static and leaks it. Prefer a function-local static const std::unordered_set<std::string> object unless there's a measured reason to avoid its destructor at shutdown.
| if (languages) { | ||
| input_languages_ = languages; |
There was a problem hiding this comment.
SetInputLanguages(nullptr) currently leaves any previously set value intact, which can produce stale ocr-langs metadata if a renderer instance is reused and the caller tries to clear the languages. Consider explicitly clearing input_languages_ when languages == nullptr (or provide a ClearInputLanguages()), so the behavior is unambiguous.
| if (languages) { | |
| input_languages_ = languages; | |
| if (languages && languages[0] != '\0') { | |
| input_languages_ = languages; | |
| } else { | |
| input_languages_.clear(); |
|
|
||
| private: | ||
| bool font_info_; // whether to print font information | ||
| std::string input_languages_; |
There was a problem hiding this comment.
Adding std::string input_languages_ to TessHOcrRenderer changes the size/layout of this exported (TESS_API) C++ class, which can break binary compatibility for downstream code that links against libtesseract and instantiates TessHOcrRenderer. If ABI stability is a concern, consider storing the new state behind an indirection (pimpl/opaque pointer) or in a separate internal structure to minimize ABI impact.
| std::string input_languages_; | |
| // NOTE: Additional per-instance state (e.g. input languages) is stored | |
| // out-of-line to avoid changing the ABI-visible layout of this class. |
| @@ -496,8 +553,47 @@ bool TessHOcrRenderer::BeginDocumentHandler() { | |||
| if (font_info_) { | |||
| AppendString(" ocrp_font ocrp_fsize"); | |||
| } | |||
| AppendString("'/>\n"); | |||
| if (!input_languages_.empty()) { | |||
| std::string langs; | |||
| std::string scripts; | |||
| const auto &iso_map = Iso639Map(); | |||
| const auto &script_set = ScriptNames(); | |||
| std::istringstream stream(input_languages_); | |||
| std::string token; | |||
| while (std::getline(stream, token, '+')) { | |||
| if (token.empty()) { | |||
There was a problem hiding this comment.
The new header metadata behavior (splitting/mapping languages and emitting new meta tags) isn't covered by existing unit tests. Since src/api/hocrrenderer.cpp already has test coverage for other hOCR functionality, it would be good to add a focused test that exercises TessHOcrRenderer::BeginDocumentHandler() output with (1) languages set and (2) languages unset, and asserts the exact meta tag(s) emitted (including proper escaping).
| AppendString(" <meta name='ocr-langs' content='"); | ||
| AppendString(langs.c_str()); | ||
| AppendString("' />\n"); | ||
| } | ||
| if (!scripts.empty()) { | ||
| AppendString(" <meta name='ocr-scripts' content='"); | ||
| AppendString(scripts.c_str()); |
There was a problem hiding this comment.
The ocr-langs / ocr-scripts meta tag contents are derived from the user-controlled -l argument (via GetInitLanguagesAsString()), but they are appended without HTML/XML escaping. This allows malformed output or HTML injection if the language string contains characters like ', &, <, etc. Consider escaping with the existing tesseract::HOcrEscape() helper before emitting into the hOCR header.
| AppendString(" <meta name='ocr-langs' content='"); | |
| AppendString(langs.c_str()); | |
| AppendString("' />\n"); | |
| } | |
| if (!scripts.empty()) { | |
| AppendString(" <meta name='ocr-scripts' content='"); | |
| AppendString(scripts.c_str()); | |
| std::string escaped_langs = HOcrEscape(langs.c_str()); | |
| AppendString(" <meta name='ocr-langs' content='"); | |
| AppendString(escaped_langs.c_str()); | |
| AppendString("' />\n"); | |
| } | |
| if (!scripts.empty()) { | |
| std::string escaped_scripts = HOcrEscape(scripts.c_str()); | |
| AppendString(" <meta name='ocr-scripts' content='"); | |
| AppendString(escaped_scripts.c_str()); |
| if (!input_languages_.empty()) { | ||
| std::string langs; | ||
| std::string scripts; | ||
| const auto &iso_map = Iso639Map(); | ||
| const auto &script_set = ScriptNames(); | ||
| std::istringstream stream(input_languages_); | ||
| std::string token; | ||
| while (std::getline(stream, token, '+')) { | ||
| if (token.empty()) { | ||
| continue; | ||
| } | ||
| if (script_set.count(token)) { | ||
| if (!scripts.empty()) { | ||
| scripts += ' '; | ||
| } | ||
| scripts += token; | ||
| } else { | ||
| if (!langs.empty()) { | ||
| langs += ' '; | ||
| } | ||
| auto it = iso_map.find(token); | ||
| if (it != iso_map.end()) { | ||
| langs += it->second; | ||
| } else { | ||
| langs += token; | ||
| } | ||
| } |
There was a problem hiding this comment.
This implementation transforms the -l string by splitting on +, converting some tokens (e.g. eng->en) and joining with spaces, so the emitted <meta name='ocr-langs' ...> does not preserve the original -l value (e.g. eng+fra). That differs from the PR description/issue request and also makes ocr-langs inconsistent with the existing lang='eng' attributes emitted elsewhere in hOCR. Please either emit the raw -l value (escaped) or update the approach/spec and ensure consistency across the document.
|
Good point about compound model names. The current implementation does exact lookup, so I'll add prefix-based fallback: strip trailing For truly arbitrary names like Also addressing Copilot's suggestions in the same push:
Re ABI: the new |
- Add prefix-based fallback for compound model names (deu_latf -> de, chi_tra_vert -> zh) by progressively stripping _suffix parts - Switch Iso639Map/ScriptNames from heap-allocated leak to function-local static const (Copilot suggestion) - Clear input_languages_ on nullptr/empty to prevent stale metadata - Add HOcrEscape for ocr-langs/ocr-scripts meta tag content to prevent HTML injection from untrusted language strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed in 43706b8:
Still need to add unit tests - will push those next. |
|
Addressed the Copilot findings. Items 1-3 and 6 were already fixed in 43706b8 (Copilot ran against the prior revision):
On the remaining items: ABI (item 4): Fair point. Tests (item 5): Will add a test exercising Raw value vs ISO 639-1 (item 7): The hOCR spec requires ISO 639-1 codes in |
Summary
Adds an
ocr-langsmeta tag to the hOCR HTML header containing the-llanguages parameter (e.g.eng+fra).Problem
When processing images with
tesseract -l eng+fra image.tif output hocr, the hOCR output includesocr-systemandocr-capabilitiesmeta tags but not the languages used for recognition. Downstream tools processing hOCR files have no way to determine which language models produced the results. Requested in #4455, related to #4046.Changes
include/tesseract/renderer.h: AddedSetInputLanguages()method andinput_languages_member toTessHOcrRenderersrc/api/hocrrenderer.cpp: ImplementedSetInputLanguages()and emit<meta name='ocr-langs' content='...' />inBeginDocumentHandler()when languages are setsrc/tesseract.cpp: Callrenderer->SetInputLanguages(api.GetInitLanguagesAsString())when creating the hOCR renderer (line 519)The existing
GetInitLanguagesAsString()API (baseapi.cpp:372) already returns the-lvalue - it just wasn't being passed to the hOCR renderer.Example output with
-l eng+fra:Testing
The meta tag is only emitted when
SetInputLanguages()is called with a non-null, non-empty string. The C API (TessHOcrRendererCreate/TessHOcrRendererCreate2) is unchanged - the meta tag won't appear unlessSetInputLanguagesis explicitly called.Fixes #4455
This contribution was developed with AI assistance (Claude Code).