Skip to content

Add languages parameter to hOCR output metadata#4531

Open
mvanhorn wants to merge 3 commits intotesseract-ocr:mainfrom
mvanhorn:osc/4455-hocr-languages-meta
Open

Add languages parameter to hOCR output metadata#4531
mvanhorn wants to merge 3 commits intotesseract-ocr:mainfrom
mvanhorn:osc/4455-hocr-languages-meta

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

Adds an ocr-langs meta tag to the hOCR HTML header containing the -l languages parameter (e.g. eng+fra).

Problem

When processing images with tesseract -l eng+fra image.tif output hocr, the hOCR output includes ocr-system and ocr-capabilities meta tags but not the languages used for recognition. Downstream tools processing hOCR files have no way to determine which language models produced the results. Requested in #4455, related to #4046.

Changes

  • include/tesseract/renderer.h: Added SetInputLanguages() method and input_languages_ member to TessHOcrRenderer
  • src/api/hocrrenderer.cpp: Implemented SetInputLanguages() and emit <meta name='ocr-langs' content='...' /> in BeginDocumentHandler() when languages are set
  • src/tesseract.cpp: Call renderer->SetInputLanguages(api.GetInitLanguagesAsString()) when creating the hOCR renderer (line 519)

The existing GetInitLanguagesAsString() API (baseapi.cpp:372) already returns the -l value - it just wasn't being passed to the hOCR renderer.

Example output with -l eng+fra:

<meta name='ocr-system' content='tesseract 5.5.1' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ...'/>
<meta name='ocr-langs' content='eng+fra' />

Testing

The meta tag is only emitted when SetInputLanguages() is called with a non-null, non-empty string. The C API (TessHOcrRendererCreate/TessHOcrRendererCreate2) is unchanged - the meta tag won't appear unless SetInputLanguages is explicitly called.

Fixes #4455

This contribution was developed with AI assistance (Claude Code).

Adds an 'ocr-langs' meta tag to the hOCR header containing the
languages string passed via -l (e.g. "eng+fra"). This lets
downstream tools know which language models were used for OCR.

The implementation adds SetInputLanguages() to TessHOcrRenderer,
called from the CLI before processing begins. The meta tag is
only emitted when languages are set.

Fixes tesseract-ocr#4455
@amitdo
Copy link
Copy Markdown
Collaborator

amitdo commented Mar 23, 2026

Copy link
Copy Markdown
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hOCR expects ISO 639-1 codes, so this pull request requires more work for languages.

In addition, Tesseract's -l parameter can also get script models which are not language specific and which require a different handling.

@amitdo
Copy link
Copy Markdown
Collaborator

amitdo commented Mar 23, 2026

Your implementation is currently not compatible with the spec.

  • "eng" -> "en"
  • "fra -> "fr"
  • "eng+fra" -> "eng fr"

You also need to take into account the script models, e.g. Latin, Greek, Hebrew

@amitdo
Copy link
Copy Markdown
Collaborator

amitdo commented Mar 23, 2026

@stweil, I didn't see your comment. I basically wrote the same thing roughly in parallel time.... :-)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Author

Pushed a fix. Languages are now converted to ISO 639-1 codes (space-separated), and script models emit a separate ocr-scripts meta tag. Covers 90+ language mappings and all 37 script models from tessdata_best. Codes without an ISO 639-1 equivalent pass through as-is.

Example with -l eng+fra+Latin:

<meta name='ocr-langs' content='en fr' />
<meta name='ocr-scripts' content='Latin' />

@stweil
Copy link
Copy Markdown
Member

stweil commented Mar 24, 2026

@copilot, please review this pull request. It currently does not handle more complex model names like for example deu_latf or chi_tra_vert. Suggest also a better solution for arbitrary model names like german_print.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds language metadata to hOCR output so downstream consumers can determine which recognition languages were used.

Changes:

  • Adds TessHOcrRenderer::SetInputLanguages() and stores the input language string in the renderer.
  • Emits new <meta name='ocr-langs' ...> (and currently also ocr-scripts) tags in the hOCR document header.
  • Wires the CLI path to pass TessBaseAPI::GetInitLanguagesAsString() into the hOCR renderer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
include/tesseract/renderer.h Extends the public hOCR renderer API with a setter and new stored state.
src/api/hocrrenderer.cpp Implements language handling and emits new hOCR header meta tags.
src/tesseract.cpp Passes initialized languages into the hOCR renderer in the CLI renderer setup.

Comment thread src/api/hocrrenderer.cpp
Comment on lines +489 to +492
static const std::unordered_map<std::string, std::string> &Iso639Map() {
static const auto *map = new std::unordered_map<std::string, std::string>{
{"afr", "af"}, {"amh", "am"}, {"ara", "ar"}, {"asm", "as"},
{"aze", "az"}, {"bel", "be"}, {"ben", "bn"}, {"bod", "bo"},
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iso639Map() allocates the map with new and intentionally leaks it for the lifetime of the process. Unless there is a specific reason to avoid static destruction here, prefer a function-local static const std::unordered_map<...> value to avoid heap allocation/leaks and simplify the code.

Copilot uses AI. Check for mistakes.
Comment thread src/api/hocrrenderer.cpp
Comment on lines +524 to +527
static const std::unordered_set<std::string> &ScriptNames() {
static const auto *set = new std::unordered_set<std::string>{
"Arabic", "Armenian", "Bengali", "Canadian_Aboriginal", "Cherokee",
"Cyrillic", "Devanagari", "Ethiopic", "Fraktur", "Georgian", "Greek",
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScriptNames() also uses new for a function-local static and leaks it. Prefer a function-local static const std::unordered_set<std::string> object unless there's a measured reason to avoid its destructor at shutdown.

Copilot uses AI. Check for mistakes.
Comment thread src/api/hocrrenderer.cpp Outdated
Comment on lines +484 to +485
if (languages) {
input_languages_ = languages;
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetInputLanguages(nullptr) currently leaves any previously set value intact, which can produce stale ocr-langs metadata if a renderer instance is reused and the caller tries to clear the languages. Consider explicitly clearing input_languages_ when languages == nullptr (or provide a ClearInputLanguages()), so the behavior is unambiguous.

Suggested change
if (languages) {
input_languages_ = languages;
if (languages && languages[0] != '\0') {
input_languages_ = languages;
} else {
input_languages_.clear();

Copilot uses AI. Check for mistakes.

private:
bool font_info_; // whether to print font information
std::string input_languages_;
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding std::string input_languages_ to TessHOcrRenderer changes the size/layout of this exported (TESS_API) C++ class, which can break binary compatibility for downstream code that links against libtesseract and instantiates TessHOcrRenderer. If ABI stability is a concern, consider storing the new state behind an indirection (pimpl/opaque pointer) or in a separate internal structure to minimize ABI impact.

Suggested change
std::string input_languages_;
// NOTE: Additional per-instance state (e.g. input languages) is stored
// out-of-line to avoid changing the ABI-visible layout of this class.

Copilot uses AI. Check for mistakes.
Comment thread src/api/hocrrenderer.cpp
Comment on lines 537 to +565
@@ -496,8 +553,47 @@ bool TessHOcrRenderer::BeginDocumentHandler() {
if (font_info_) {
AppendString(" ocrp_font ocrp_fsize");
}
AppendString("'/>\n");
if (!input_languages_.empty()) {
std::string langs;
std::string scripts;
const auto &iso_map = Iso639Map();
const auto &script_set = ScriptNames();
std::istringstream stream(input_languages_);
std::string token;
while (std::getline(stream, token, '+')) {
if (token.empty()) {
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new header metadata behavior (splitting/mapping languages and emitting new meta tags) isn't covered by existing unit tests. Since src/api/hocrrenderer.cpp already has test coverage for other hOCR functionality, it would be good to add a focused test that exercises TessHOcrRenderer::BeginDocumentHandler() output with (1) languages set and (2) languages unset, and asserts the exact meta tag(s) emitted (including proper escaping).

Copilot uses AI. Check for mistakes.
Comment thread src/api/hocrrenderer.cpp Outdated
Comment on lines +586 to +592
AppendString(" <meta name='ocr-langs' content='");
AppendString(langs.c_str());
AppendString("' />\n");
}
if (!scripts.empty()) {
AppendString(" <meta name='ocr-scripts' content='");
AppendString(scripts.c_str());
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ocr-langs / ocr-scripts meta tag contents are derived from the user-controlled -l argument (via GetInitLanguagesAsString()), but they are appended without HTML/XML escaping. This allows malformed output or HTML injection if the language string contains characters like ', &, <, etc. Consider escaping with the existing tesseract::HOcrEscape() helper before emitting into the hOCR header.

Suggested change
AppendString(" <meta name='ocr-langs' content='");
AppendString(langs.c_str());
AppendString("' />\n");
}
if (!scripts.empty()) {
AppendString(" <meta name='ocr-scripts' content='");
AppendString(scripts.c_str());
std::string escaped_langs = HOcrEscape(langs.c_str());
AppendString(" <meta name='ocr-langs' content='");
AppendString(escaped_langs.c_str());
AppendString("' />\n");
}
if (!scripts.empty()) {
std::string escaped_scripts = HOcrEscape(scripts.c_str());
AppendString(" <meta name='ocr-scripts' content='");
AppendString(escaped_scripts.c_str());

Copilot uses AI. Check for mistakes.
Comment thread src/api/hocrrenderer.cpp
Comment on lines +557 to +583
if (!input_languages_.empty()) {
std::string langs;
std::string scripts;
const auto &iso_map = Iso639Map();
const auto &script_set = ScriptNames();
std::istringstream stream(input_languages_);
std::string token;
while (std::getline(stream, token, '+')) {
if (token.empty()) {
continue;
}
if (script_set.count(token)) {
if (!scripts.empty()) {
scripts += ' ';
}
scripts += token;
} else {
if (!langs.empty()) {
langs += ' ';
}
auto it = iso_map.find(token);
if (it != iso_map.end()) {
langs += it->second;
} else {
langs += token;
}
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation transforms the -l string by splitting on +, converting some tokens (e.g. eng->en) and joining with spaces, so the emitted <meta name='ocr-langs' ...> does not preserve the original -l value (e.g. eng+fra). That differs from the PR description/issue request and also makes ocr-langs inconsistent with the existing lang='eng' attributes emitted elsewhere in hOCR. Please either emit the raw -l value (escaped) or update the approach/spec and ensure consistency across the document.

Copilot uses AI. Check for mistakes.
@mvanhorn
Copy link
Copy Markdown
Author

Good point about compound model names. The current implementation does exact lookup, so deu_latf passes through unmapped instead of resolving to de.

I'll add prefix-based fallback: strip trailing _suffix parts progressively and retry shorter keys. deu_latf -> try deu -> "de". chi_tra_vert -> try chi_tra -> "zh".

For truly arbitrary names like german_print where no prefix matches, they pass through as-is. The spec doesn't define a mapping for custom model names, and guessing would be worse than emitting the raw identifier.

Also addressing Copilot's suggestions in the same push:

  • Switch from new to function-local static const for Iso639Map/ScriptNames
  • Clear input_languages_ on nullptr
  • Add HOcrEscape() for meta tag content
  • Add unit tests

Re ABI: the new std::string member does change class layout. Happy to move it behind an opaque pointer if that's a concern - let me know.

- Add prefix-based fallback for compound model names (deu_latf -> de,
  chi_tra_vert -> zh) by progressively stripping _suffix parts
- Switch Iso639Map/ScriptNames from heap-allocated leak to function-local
  static const (Copilot suggestion)
- Clear input_languages_ on nullptr/empty to prevent stale metadata
- Add HOcrEscape for ocr-langs/ocr-scripts meta tag content to prevent
  HTML injection from untrusted language strings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Author

Addressed in 43706b8:

  • Compound model names now resolve via prefix fallback: deu_latf -> try deu -> "de", chi_tra_vert -> try chi_tra -> "zh". Arbitrary names like german_print pass through as-is since no ISO 639-1 mapping exists for custom models.
  • Switched Iso639Map/ScriptNames from heap-allocated new to function-local static const
  • SetInputLanguages(nullptr) now clears stale metadata
  • Added HOcrEscape() for meta tag content

Still need to add unit tests - will push those next.

@mvanhorn
Copy link
Copy Markdown
Author

Addressed the Copilot findings. Items 1-3 and 6 were already fixed in 43706b8 (Copilot ran against the prior revision):

  • Static function-local maps (not new/leak) - already done
  • SetInputLanguages(nullptr) clears the string - already done
  • HOcrEscape() applied to meta tag content - already done

On the remaining items:

ABI (item 4): Fair point. input_languages_ adds a std::string member to an exported class. If tesseract guarantees ABI stability across patch releases, this should go behind a pimpl. Happy to refactor if maintainers prefer that approach.

Tests (item 5): Will add a test exercising BeginDocumentHandler() with and without languages set.

Raw value vs ISO 639-1 (item 7): The hOCR spec requires ISO 639-1 codes in ocr-langs (per the spec @amitdo linked). @stweil and @amitdo both confirmed this direction. The conversion is intentional and spec-compliant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

write languages parameter value to hocr output file

5 participants