-
Notifications
You must be signed in to change notification settings - Fork 10.6k
doc: Add comprehensive PARAMETERS section to the tesseract man page #4526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
da5ea05
abafead
f270828
7e49f71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -406,6 +406,318 @@ one per line. The format of the latter is documented in 'dict/trie.h' | |
| on 'read_pattern_list()'. | ||
|
|
||
|
|
||
| [[PARAMETERS]] | ||
| PARAMETERS | ||
| ---------- | ||
|
|
||
| Tesseract parameters control the behaviour of the OCR engine and can be set | ||
| using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by | ||
| placing them in a <<CONFIGFILE,'CONFIGFILE'>>. | ||
| The example above restricts recognized characters to digits only. | ||
| Run *--print-parameters* to list all available parameters with their current | ||
| values and short descriptions. | ||
|
|
||
| The engine column indicates which OCR engine modes support a parameter: + | ||
| *Both* -- works with both the LSTM and Legacy engines; + | ||
| *LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); + | ||
| *Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2). | ||
|
|
||
| OUTPUT FORMAT PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *tessedit_create_txt* (bool, default: 0) [Both]:: | ||
| Write plain-text output to a `.txt` file. This is the default output format | ||
| when no config file or *-c* option overrides it. | ||
|
|
||
| *tessedit_create_hocr* (bool, default: 0) [Both]:: | ||
| Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that | ||
| encodes the OCR results together with their bounding boxes and confidences. | ||
| Use the *hocr* config file to enable this format. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having a link here to the specification may be useful |
||
| See the hOCR specification at https://kba.github.io/hocr-spec/1.2/ for | ||
| details of the output format. | ||
|
|
||
| *hocr_font_info* (bool, default: 0) [Both]:: | ||
| Add per-word font metadata to hOCR output. When enabled, each word span | ||
| includes `x_font` (font name) and `x_fsize` (point size) attributes in its | ||
| `title` field. Font information is derived from the recognition results and | ||
| is most reliable with the Legacy OCR engine; the LSTM engine may produce | ||
| less precise font names. | ||
|
|
||
| *hocr_char_boxes* (bool, default: 0) [Both]:: | ||
| Add per-character bounding-box coordinates to hOCR output as `ocrx_cinfo` | ||
| spans. Note that character-level bounding boxes may be less precise when | ||
| using the LSTM engine, because LSTM operates at the line level and | ||
| character positions are approximated from the line result. | ||
| See also *lstm_choice_mode* for LSTM-specific character alternative output. | ||
|
|
||
| *tessedit_create_alto* (bool, default: 0) [Both]:: | ||
| Write ALTO XML output to a `.xml` file. ALTO (Analyzed Layout and Text | ||
| Object) is a standard XML schema for describing the layout and content of | ||
| pages. Use the *alto* config file to enable this format. | ||
|
|
||
| *tessedit_create_page_xml* (bool, default: 0) [Both]:: | ||
| Write PAGE XML output to a `.page.xml` file. PAGE (Page Analysis and | ||
| Ground Truth Elements) is a standard XML format widely used in digital | ||
| humanities projects, library and archive workflows, and document annotation | ||
| tools such as Transkribus and eScriptorium. | ||
| See https://github.com/PRImA-Research-Lab/PAGE-XML for the specification. | ||
| Use the *page* config file to enable this format. | ||
|
|
||
| *page_xml_polygon* (bool, default: 1) [Both]:: | ||
| When writing PAGE XML output, create polygon outlines around text regions | ||
| instead of simple bounding boxes. | ||
|
|
||
| *page_xml_level* (int, default: 0) [Both]:: | ||
| Granularity of PAGE XML output: 0 = line level, 1 = word level. | ||
|
|
||
| *tessedit_create_tsv* (bool, default: 0) [Both]:: | ||
| Write tab-separated-values output to a `.tsv` file. Each recognized word is | ||
| output as one row with its bounding box, confidence and text. Use the *tsv* | ||
| config file to enable this format. | ||
|
|
||
| *tessedit_create_pdf* (bool, default: 0) [Both]:: | ||
| Write a searchable PDF to a `.pdf` file. The PDF contains the original image | ||
| with an invisible text layer for copy-paste and searching. Use the *pdf* | ||
| config file to enable this format. | ||
|
|
||
| *textonly_pdf* (bool, default: 0) [Both]:: | ||
| Write a text-only PDF (no image, only invisible text) to a `.pdf` file. | ||
|
|
||
| *tessedit_create_boxfile* (bool, default: 0) [Both]:: | ||
| Write a Tesseract box file (`.box`) that lists each recognized character with | ||
| its bounding box, one per line. Can be produced by either engine. These | ||
| files are primarily used as ground truth for legacy engine training. | ||
|
|
||
| *tessedit_create_wordstrbox* (bool, default: 0) [Both]:: | ||
| Write a WordStr-format box file (`.box`). Similar to *tessedit_create_boxfile* | ||
| but records whole words instead of individual characters. | ||
|
|
||
| *tessedit_create_lstmbox* (bool, default: 0) [LSTM]:: | ||
| Write an LSTM box file (`.box`) suitable for LSTM training. | ||
|
|
||
| *preserve_interword_spaces* (bool, default: 0) [Both]:: | ||
| Preserve multiple consecutive inter-word spaces in the output instead of | ||
| collapsing them to a single space. | ||
|
|
||
| CHARACTER SET PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *tessedit_char_whitelist* (string, default: "") [Both]:: | ||
| Restrict the set of characters that Tesseract will recognize to only those | ||
| listed in this string (allowlist). For example, setting this to `0123456789` | ||
| will make Tesseract return only digits. An empty value (the default) means | ||
| all characters in the trained data are allowed. | ||
|
|
||
| *tessedit_char_blacklist* (string, default: "") [Both]:: | ||
| Prevent Tesseract from recognizing the characters listed in this string. | ||
| Characters in this list will never appear in the output. This exclusion list | ||
| is applied after the allowlist (*tessedit_char_whitelist*). | ||
|
|
||
| *tessedit_char_unblacklist* (string, default: "") [Both]:: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a legacy parameter? More about the originals multiple passes. Bit odd having an unblacklist but that could be me and my thought process.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in commit 7e49f71. The descriptive text now uses "allowlist" and "exclusion list" instead of "whitelist"/"blacklist". All three parameter names ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth deprecating these terms and have the same functionality tied to tessedit_chars_allowed That way the terms still work so it's not a breaking change but the documentation surfaces better language. Feels like this would address your comment better. |
||
| Re-allow specific characters that were excluded by *tessedit_char_blacklist*. | ||
| Characters in this list override the exclusion list and apply to both the | ||
| LSTM and Legacy engines. | ||
|
|
||
| IMAGE PROCESSING PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *thresholding_method* (int, default: 0) [Both]:: | ||
| Select the algorithm used to convert a greyscale image to binary before OCR: | ||
| 0 = Otsu global thresholding (default); | ||
| 1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting); | ||
| 2 = Sauvola local adaptive thresholding (best for heavily degraded documents). | ||
|
|
||
| *thresholding_window_size* (double, default: 0.33) [Both]:: | ||
| Window size (multiplied by image DPI) used to compute local statistics for | ||
| the Sauvola thresholding method (*thresholding_method* = 2). | ||
|
|
||
| *thresholding_kfactor* (double, default: 0.34) [Both]:: | ||
| Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2). | ||
| Controls how much the local variance reduces the threshold. Typical range: | ||
| 0.2 -- 0.5. Higher values produce more aggressive thresholding. | ||
|
|
||
| *thresholding_tile_size* (double, default: 0.33) [Both]:: | ||
| Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled | ||
| thresholding method (*thresholding_method* = 1). | ||
|
|
||
| *thresholding_smooth_kernel_size* (double, default: 0) [Both]:: | ||
| Kernel size for smoothing the threshold array produced by LeptonicaOtsu | ||
| (*thresholding_method* = 1). Use 0 for no smoothing. | ||
|
|
||
| *thresholding_score_fraction* (double, default: 0.1) [Both]:: | ||
| Fraction of the maximum Otsu score used by LeptonicaOtsu | ||
| (*thresholding_method* = 1). Use 0.0 for standard Otsu behaviour; | ||
| 0.1 is recommended for better robustness. | ||
|
|
||
| *tessedit_do_invert* (bool, default: 1) [Both]:: | ||
| Deprecated -- will be removed in a future release. When enabled, Tesseract | ||
| tries OCR on an inverted (white-on-black) copy of lines whose mean confidence | ||
| falls below *invert_threshold* and keeps the result with higher confidence. | ||
| To disable automatic inversion, set *invert_threshold* = 0 rather than | ||
| setting this parameter to 0. | ||
|
|
||
| *invert_threshold* (double, default: 0.7) [Both]:: | ||
| Mean confidence threshold below which Tesseract will also attempt OCR on | ||
| the inverted image. Lower values make inversion less likely. Set to 0 to | ||
| disable automatic inversion entirely (preferred over setting | ||
| *tessedit_do_invert* = 0, which is deprecated). | ||
|
|
||
| *user_defined_dpi* (int, default: 0) [Both]:: | ||
| Override the resolution of the input image in DPI. Use this when the image | ||
| metadata contains an incorrect or missing DPI value. A value of 0 means | ||
| the resolution is read from the image metadata or guessed automatically. | ||
| This parameter is equivalent to the *--dpi* command-line option; when *--dpi* | ||
| is given on the command line it simply sets this parameter. | ||
|
|
||
| *textord_heavy_nr* (bool, default: 0) [Both]:: | ||
| Aggressively remove noise blobs during page layout analysis. When disabled | ||
| (the default), Tesseract applies a moderate noise threshold that preserves | ||
| most legitimate characters. Enabling this raises the threshold significantly, | ||
| which can improve results on scans with heavy speckle or background noise, but | ||
| may also remove small legitimate characters such as punctuation marks, | ||
| diacritics, or small symbols. Useful when OCR-ing degraded or low-quality | ||
| document scans where accuracy on punctuation is less important than overall | ||
| text extraction. | ||
|
|
||
| DICTIONARY PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *load_system_dawg* (bool, default: 1) [Both]:: | ||
| Load the main system word list (DAWG) from the traineddata file. Disabling | ||
| this can speed up recognition and may improve results when OCR-ing content | ||
| that does not resemble natural language (e.g. codes, identifiers). | ||
|
|
||
| *load_freq_dawg* (bool, default: 1) [Both]:: | ||
| Load the list of frequent words from the traineddata file. | ||
|
|
||
| *load_unambig_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the list of unambiguous words from the traineddata file. | ||
|
|
||
| *load_punc_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing punctuation patterns from the traineddata file. | ||
|
|
||
| *load_number_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing number patterns from the traineddata file. | ||
|
|
||
| *load_bigram_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing special word bigrams from the traineddata file. | ||
|
|
||
| *user_words_file* (string, default: "") [Both]:: | ||
| Path to a plain-text file containing additional words (one per line) that | ||
| Tesseract should treat as valid dictionary words. | ||
|
|
||
| *user_words_suffix* (string, default: "") [Both]:: | ||
| Filename suffix (relative to the tessdata directory) for a per-language file | ||
| of additional valid words. For example, setting this to `user-words` causes | ||
| Tesseract to look for `eng.user-words` when using the English model. | ||
|
|
||
| *user_patterns_file* (string, default: "") [Both]:: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should you wish to beef this up a tiny bit... Defined in dict/trie.h, but in simplified terms: A becomes uppercase letter Other symbols match themselves. |
||
| Path to a plain-text file containing additional pattern strings (one per | ||
| line) that Tesseract should accept as valid words. In the pattern language, | ||
| backslash-escaped sequences specify character classes: | ||
| `\d` = any digit; | ||
| `\c` = any letter; | ||
| `\a` = any lowercase letter; | ||
| `\A` = any uppercase letter; | ||
| `\n` = any alphanumeric character; | ||
| `\p` = any punctuation character. | ||
| All other characters match themselves. For example, `1-\d\d\d-GOOG-411` | ||
| matches a phone-number-like string. | ||
| These are structural templates, not regular expressions. For the full | ||
| pattern syntax, see `dict/trie.h` in the Tesseract source. | ||
|
|
||
| *user_patterns_suffix* (string, default: "") [Both]:: | ||
| Filename suffix (relative to the tessdata directory) for a per-language file | ||
| of additional patterns. | ||
|
|
||
| LSTM ENGINE PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| These parameters are only meaningful when using the LSTM OCR engine | ||
| (*--oem 1* or *--oem 2*). | ||
|
|
||
| *lstm_use_matrix* (bool, default: 1) [LSTM]:: | ||
| Use the ratings matrix and beam search during LSTM decoding. Disabling this | ||
| reverts to a simpler, faster greedy decoding strategy that may be adequate | ||
| for very clean, high-quality images but generally gives lower accuracy. | ||
|
|
||
| *lstm_choice_mode* (int, default: 0) [LSTM]:: | ||
| Enables alternative character hypotheses in hOCR output (requires | ||
| *tessedit_create_hocr* = 1): | ||
| 0 = disabled (default); | ||
| 1 = include per-timestep alternative choices; | ||
| 2 = extract alternative choices from the CTC output mapped per character. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand. |
||
| See also *hocr_char_boxes* for character bounding-box output. | ||
|
|
||
| *lstm_choice_iterations* (int, default: 5) [LSTM]:: | ||
| Number of cascading beam-search iterations used when *lstm_choice_mode* is | ||
| non-zero. | ||
|
|
||
| *lstm_rating_coefficient* (double, default: 5) [LSTM]:: | ||
| Scaling factor applied to LSTM character ratings. Smaller values produce | ||
| higher (better) confidence scores and preserve more information before the | ||
| zero cut-off. The default value is 5. | ||
|
|
||
| LEGACY ENGINE PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The following parameters apply only when using the legacy Tesseract engine | ||
| (*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy | ||
| model such as those from https://github.com/tesseract-ocr/tessdata). | ||
|
|
||
| *tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]:: | ||
| Apply bigram-based correction to improve recognition of adjacent word pairs | ||
| that commonly appear together (e.g. "is a", "in the", "New York"). The | ||
| correction uses a bigram dictionary from the traineddata file to re-score | ||
| word hypotheses in context. | ||
|
|
||
| *tessedit_enable_dict_correction* (bool, default: 0) [Legacy]:: | ||
| Use the dictionary to post-correct uncertain word hypotheses. | ||
|
|
||
| *tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]:: | ||
| Try to fix spaces that were ambiguously classified as inter-word or | ||
| inter-character gaps. | ||
|
|
||
| *language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]:: | ||
| Penalty added to the score of word hypotheses that do not appear in the | ||
| dictionary. Increase to bias recognition more strongly towards dictionary | ||
| words. | ||
|
|
||
| *language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]:: | ||
| Additional penalty for words that are in the dictionary but not in the list | ||
| of frequent words. | ||
|
|
||
| *language_model_penalty_case* (double, default: 0.1) [Legacy]:: | ||
| Penalty applied when the capitalisation of a recognised word is inconsistent | ||
| with the surrounding context. | ||
|
|
||
| *language_model_penalty_script* (double, default: 0.5) [Legacy]:: | ||
| Penalty applied when a recognised character belongs to a different script | ||
| from the surrounding text. | ||
|
|
||
| *language_model_penalty_punc* (double, default: 0.2) [Legacy]:: | ||
| Penalty applied for punctuation usage that is inconsistent with the language | ||
| model. | ||
|
|
||
| *wordrec_enable_assoc* (bool, default: 1) [Legacy]:: | ||
| Enable the associator, which considers combinations of character fragments | ||
| when forming word hypotheses. Disabling may speed up recognition at the | ||
| cost of accuracy on fragmented characters. | ||
|
|
||
| DEBUG PARAMETERS | ||
| ~~~~~~~~~~~~~~~~ | ||
|
|
||
| *debug_file* (string, default: "") [Both]:: | ||
| Redirect Tesseract debug/diagnostic output to this file instead of stderr. | ||
| Set to `/dev/null` (or use the *quiet* config file) to suppress all debug | ||
| output. | ||
|
|
||
| *tessedit_write_params_to_file* (string, default: "") [Both]:: | ||
| If set to a filename, Tesseract will write the values of all its parameters | ||
| to that file when it starts up. Useful for capturing the effective | ||
| configuration for debugging or reproducibility. | ||
|
|
||
|
|
||
| ENVIRONMENT VARIABLES | ||
| --------------------- | ||
| *`TESSDATA_PREFIX`*:: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀