Add languages parameter to hOCR output metadata by mvanhorn · Pull Request #4531 · tesseract-ocr/tesseract

mvanhorn · 2026-03-20T17:39:55Z

Summary

Adds an ocr-langs meta tag to the hOCR HTML header containing the -l languages parameter (e.g. eng+fra).

Problem

When processing images with tesseract -l eng+fra image.tif output hocr, the hOCR output includes ocr-system and ocr-capabilities meta tags but not the languages used for recognition. Downstream tools processing hOCR files have no way to determine which language models produced the results. Requested in #4455, related to #4046.

Changes

include/tesseract/renderer.h: Added SetInputLanguages() method and input_languages_ member to TessHOcrRenderer
src/api/hocrrenderer.cpp: Implemented SetInputLanguages() and emit <meta name='ocr-langs' content='...' /> in BeginDocumentHandler() when languages are set
src/tesseract.cpp: Call renderer->SetInputLanguages(api.GetInitLanguagesAsString()) when creating the hOCR renderer (line 519)

The existing GetInitLanguagesAsString() API (baseapi.cpp:372) already returns the -l value - it just wasn't being passed to the hOCR renderer.

Example output with -l eng+fra:

<meta name='ocr-system' content='tesseract 5.5.1' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ...'/>
<meta name='ocr-langs' content='eng+fra' />

Testing

The meta tag is only emitted when SetInputLanguages() is called with a non-null, non-empty string. The C API (TessHOcrRendererCreate/TessHOcrRendererCreate2) is unchanged - the meta tag won't appear unless SetInputLanguages is explicitly called.

Fixes #4455

This contribution was developed with AI assistance (Claude Code).

Adds an 'ocr-langs' meta tag to the hOCR header containing the languages string passed via -l (e.g. "eng+fra"). This lets downstream tools know which language models were used for OCR. The implementation adds SetInputLanguages() to TessHOcrRenderer, called from the CLI before processing begins. The meta tag is only emitted when languages are set. Fixes tesseract-ocr#4455

amitdo · 2026-03-23T10:12:28Z

https://github.com/kba/hocr-spec

https://kba.github.io/hocr-spec/1.2/

ocr-langs

ocr-scripts

metadata example

stweil

hOCR expects ISO 639-1 codes, so this pull request requires more work for languages.

In addition, Tesseract's -l parameter can also get script models which are not language specific and which require a different handling.

amitdo · 2026-03-23T10:26:57Z

Your implementation is currently not compatible with the spec.

"eng" -> "en"
"fra -> "fr"
"eng+fra" -> "eng fr"

You also need to take into account the script models, e.g. Latin, Greek, Hebrew

amitdo · 2026-03-23T10:40:49Z

@stweil, I didn't see your comment. I basically wrote the same thing roughly in parallel time.... :-)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mvanhorn · 2026-03-24T21:10:09Z

Pushed a fix. Languages are now converted to ISO 639-1 codes (space-separated), and script models emit a separate ocr-scripts meta tag. Covers 90+ language mappings and all 37 script models from tessdata_best. Codes without an ISO 639-1 equivalent pass through as-is.

Example with -l eng+fra+Latin:

<meta name='ocr-langs' content='en fr' />
<meta name='ocr-scripts' content='Latin' />

stweil · 2026-03-24T21:19:01Z

@copilot, please review this pull request. It currently does not handle more complex model names like for example deu_latf or chi_tra_vert. Suggest also a better solution for arbitrary model names like german_print.

Copilot

Pull request overview

Adds language metadata to hOCR output so downstream consumers can determine which recognition languages were used.

Changes:

Adds TessHOcrRenderer::SetInputLanguages() and stores the input language string in the renderer.
Emits new <meta name='ocr-langs' ...> (and currently also ocr-scripts) tags in the hOCR document header.
Wires the CLI path to pass TessBaseAPI::GetInitLanguagesAsString() into the hOCR renderer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`include/tesseract/renderer.h`	Extends the public hOCR renderer API with a setter and new stored state.
`src/api/hocrrenderer.cpp`	Implements language handling and emits new hOCR header meta tags.
`src/tesseract.cpp`	Passes initialized languages into the hOCR renderer in the CLI renderer setup.

Copilot · 2026-03-24T21:24:29Z

+static const std::unordered_map<std::string, std::string> &Iso639Map() {
+  static const auto *map = new std::unordered_map<std::string, std::string>{
+      {"afr", "af"}, {"amh", "am"}, {"ara", "ar"}, {"asm", "as"},
+      {"aze", "az"}, {"bel", "be"}, {"ben", "bn"}, {"bod", "bo"},


Iso639Map() allocates the map with new and intentionally leaks it for the lifetime of the process. Unless there is a specific reason to avoid static destruction here, prefer a function-local static const std::unordered_map<...> value to avoid heap allocation/leaks and simplify the code.

Copilot · 2026-03-24T21:24:29Z

+static const std::unordered_set<std::string> &ScriptNames() {
+  static const auto *set = new std::unordered_set<std::string>{
+      "Arabic", "Armenian", "Bengali", "Canadian_Aboriginal", "Cherokee",
+      "Cyrillic", "Devanagari", "Ethiopic", "Fraktur", "Georgian", "Greek",


ScriptNames() also uses new for a function-local static and leaks it. Prefer a function-local static const std::unordered_set<std::string> object unless there's a measured reason to avoid its destructor at shutdown.

Copilot · 2026-03-24T21:24:29Z

+  if (languages) {
+    input_languages_ = languages;


SetInputLanguages(nullptr) currently leaves any previously set value intact, which can produce stale ocr-langs metadata if a renderer instance is reused and the caller tries to clear the languages. Consider explicitly clearing input_languages_ when languages == nullptr (or provide a ClearInputLanguages()), so the behavior is unambiguous.

Suggested change

if (languages) {

input_languages_ = languages;

if (languages && languages[0] != '\0') {

input_languages_ = languages;

} else {

input_languages_.clear();

Copilot · 2026-03-24T21:24:30Z


 private:
  bool font_info_; // whether to print font information
+  std::string input_languages_;


Adding std::string input_languages_ to TessHOcrRenderer changes the size/layout of this exported (TESS_API) C++ class, which can break binary compatibility for downstream code that links against libtesseract and instantiates TessHOcrRenderer. If ABI stability is a concern, consider storing the new state behind an indirection (pimpl/opaque pointer) or in a separate internal structure to minimize ABI impact.

Suggested change

std::string input_languages_;

// NOTE: Additional per-instance state (e.g. input languages) is stored

// out-of-line to avoid changing the ABI-visible layout of this class.

Copilot · 2026-03-24T21:24:30Z

@@ -496,8 +553,47 @@ bool TessHOcrRenderer::BeginDocumentHandler() {
  if (font_info_) {
    AppendString(" ocrp_font ocrp_fsize");
  }
+  AppendString("'/>\n");
+  if (!input_languages_.empty()) {
+    std::string langs;
+    std::string scripts;
+    const auto &iso_map = Iso639Map();
+    const auto &script_set = ScriptNames();
+    std::istringstream stream(input_languages_);
+    std::string token;
+    while (std::getline(stream, token, '+')) {
+      if (token.empty()) {


The new header metadata behavior (splitting/mapping languages and emitting new meta tags) isn't covered by existing unit tests. Since src/api/hocrrenderer.cpp already has test coverage for other hOCR functionality, it would be good to add a focused test that exercises TessHOcrRenderer::BeginDocumentHandler() output with (1) languages set and (2) languages unset, and asserts the exact meta tag(s) emitted (including proper escaping).

Copilot · 2026-03-24T21:24:30Z

+      AppendString("  <meta name='ocr-langs' content='");
+      AppendString(langs.c_str());
+      AppendString("' />\n");
+    }
+    if (!scripts.empty()) {
+      AppendString("  <meta name='ocr-scripts' content='");
+      AppendString(scripts.c_str());


The ocr-langs / ocr-scripts meta tag contents are derived from the user-controlled -l argument (via GetInitLanguagesAsString()), but they are appended without HTML/XML escaping. This allows malformed output or HTML injection if the language string contains characters like ', &, <, etc. Consider escaping with the existing tesseract::HOcrEscape() helper before emitting into the hOCR header.

Suggested change

AppendString(" <meta name='ocr-langs' content='");

AppendString(langs.c_str());

AppendString("' />\n");

}

if (!scripts.empty()) {

AppendString(" <meta name='ocr-scripts' content='");

AppendString(scripts.c_str());

std::string escaped_langs = HOcrEscape(langs.c_str());

AppendString(" <meta name='ocr-langs' content='");

AppendString(escaped_langs.c_str());

AppendString("' />\n");

}

if (!scripts.empty()) {

std::string escaped_scripts = HOcrEscape(scripts.c_str());

AppendString(" <meta name='ocr-scripts' content='");

AppendString(escaped_scripts.c_str());

Copilot · 2026-03-24T21:24:30Z

+  if (!input_languages_.empty()) {
+    std::string langs;
+    std::string scripts;
+    const auto &iso_map = Iso639Map();
+    const auto &script_set = ScriptNames();
+    std::istringstream stream(input_languages_);
+    std::string token;
+    while (std::getline(stream, token, '+')) {
+      if (token.empty()) {
+        continue;
+      }
+      if (script_set.count(token)) {
+        if (!scripts.empty()) {
+          scripts += ' ';
+        }
+        scripts += token;
+      } else {
+        if (!langs.empty()) {
+          langs += ' ';
+        }
+        auto it = iso_map.find(token);
+        if (it != iso_map.end()) {
+          langs += it->second;
+        } else {
+          langs += token;
+        }
+      }


This implementation transforms the -l string by splitting on +, converting some tokens (e.g. eng->en) and joining with spaces, so the emitted <meta name='ocr-langs' ...> does not preserve the original -l value (e.g. eng+fra). That differs from the PR description/issue request and also makes ocr-langs inconsistent with the existing lang='eng' attributes emitted elsewhere in hOCR. Please either emit the raw -l value (escaped) or update the approach/spec and ensure consistency across the document.

mvanhorn · 2026-03-25T02:10:18Z

Good point about compound model names. The current implementation does exact lookup, so deu_latf passes through unmapped instead of resolving to de.

I'll add prefix-based fallback: strip trailing _suffix parts progressively and retry shorter keys. deu_latf -> try deu -> "de". chi_tra_vert -> try chi_tra -> "zh".

For truly arbitrary names like german_print where no prefix matches, they pass through as-is. The spec doesn't define a mapping for custom model names, and guessing would be worse than emitting the raw identifier.

Also addressing Copilot's suggestions in the same push:

Switch from new to function-local static const for Iso639Map/ScriptNames
Clear input_languages_ on nullptr
Add HOcrEscape() for meta tag content
Add unit tests

Re ABI: the new std::string member does change class layout. Happy to move it behind an opaque pointer if that's a concern - let me know.

- Add prefix-based fallback for compound model names (deu_latf -> de, chi_tra_vert -> zh) by progressively stripping _suffix parts - Switch Iso639Map/ScriptNames from heap-allocated leak to function-local static const (Copilot suggestion) - Clear input_languages_ on nullptr/empty to prevent stale metadata - Add HOcrEscape for ocr-langs/ocr-scripts meta tag content to prevent HTML injection from untrusted language strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mvanhorn · 2026-03-25T02:21:25Z

Addressed in 43706b8:

Compound model names now resolve via prefix fallback: deu_latf -> try deu -> "de", chi_tra_vert -> try chi_tra -> "zh". Arbitrary names like german_print pass through as-is since no ISO 639-1 mapping exists for custom models.
Switched Iso639Map/ScriptNames from heap-allocated new to function-local static const
SetInputLanguages(nullptr) now clears stale metadata
Added HOcrEscape() for meta tag content

Still need to add unit tests - will push those next.

mvanhorn · 2026-03-27T21:12:58Z

Addressed the Copilot findings. Items 1-3 and 6 were already fixed in 43706b8 (Copilot ran against the prior revision):

Static function-local maps (not new/leak) - already done
SetInputLanguages(nullptr) clears the string - already done
HOcrEscape() applied to meta tag content - already done

On the remaining items:

ABI (item 4): Fair point. input_languages_ adds a std::string member to an exported class. If tesseract guarantees ABI stability across patch releases, this should go behind a pimpl. Happy to refactor if maintainers prefer that approach.

Tests (item 5): Will add a test exercising BeginDocumentHandler() with and without languages set.

Raw value vs ISO 639-1 (item 7): The hOCR spec requires ISO 639-1 codes in ocr-langs (per the spec @amitdo linked). @stweil and @amitdo both confirmed this direction. The conversion is intentional and spec-compliant.

mvanhorn mentioned this pull request Mar 20, 2026

write languages parameter value to hocr output file #4455

Open

egorpugin approved these changes Mar 20, 2026

View reviewed changes

stweil requested changes Mar 23, 2026

View reviewed changes

fix: use ISO 639-1 codes and separate script models in hOCR metadata

fbec60b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stweil requested a review from Copilot March 24, 2026 21:19

Copilot started reviewing on behalf of stweil March 24, 2026 21:20 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

	std::string input_languages_;
	// NOTE: Additional per-instance state (e.g. input languages) is stored
	// out-of-line to avoid changing the ABI-visible layout of this class.

Conversation

mvanhorn commented Mar 20, 2026

Summary

Problem

Changes

Testing

Uh oh!

amitdo commented Mar 23, 2026

Uh oh!

stweil left a comment

Choose a reason for hiding this comment

Uh oh!

amitdo commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitdo commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvanhorn commented Mar 24, 2026

Uh oh!

stweil commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

mvanhorn commented Mar 25, 2026

Uh oh!

mvanhorn commented Mar 25, 2026

Uh oh!

mvanhorn commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amitdo commented Mar 23, 2026 •

edited

Loading

amitdo commented Mar 23, 2026 •

edited

Loading