-
Notifications
You must be signed in to change notification settings - Fork 10.6k
doc: Add comprehensive PARAMETERS section to the tesseract man page #4526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
da5ea05
abafead
f270828
7e49f71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -406,6 +406,277 @@ one per line. The format of the latter is documented in 'dict/trie.h' | |
| on 'read_pattern_list()'. | ||
|
|
||
|
|
||
| [[PARAMETERS]] | ||
| PARAMETERS | ||
| ---------- | ||
|
|
||
| Tesseract parameters control the behaviour of the OCR engine and can be set | ||
| using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by | ||
| placing them in a <<CONFIGFILE,'CONFIGFILE'>>. | ||
| Run *--print-parameters* to list all available parameters with their current | ||
| values and short descriptions. | ||
|
|
||
| The engine column indicates which OCR engine modes support a parameter: + | ||
| *Both* -- works with both the LSTM and Legacy engines; + | ||
| *LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); + | ||
| *Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2). | ||
|
|
||
| OUTPUT FORMAT PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *tessedit_create_txt* (bool, default: 0) [Both]:: | ||
| Write plain-text output to a `.txt` file. This is the default output format | ||
| when no config file or *-c* option overrides it. | ||
|
|
||
| *tessedit_create_hocr* (bool, default: 0) [Both]:: | ||
| Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that | ||
| encodes the OCR results together with their bounding boxes and confidences. | ||
| Use the *hocr* config file to enable this format. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having a link here to the specification may be useful |
||
|
|
||
| *hocr_font_info* (bool, default: 0) [Both]:: | ||
| Include font information in hOCR output. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what information and how reliable is it? I'd love to know that before I enabled this. |
||
|
|
||
| *hocr_char_boxes* (bool, default: 0) [Both]:: | ||
| Add per-character bounding-box coordinates to hOCR output. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My knowledge here might be out of date, but while using tesseract in the past I found that the char boxes only really worked on the legacy OCR. and So I had 26 failures out of 131 or a 20% failure rate using the LSTM. Worth noting that this is still under development maybe? |
||
|
|
||
| *tessedit_create_alto* (bool, default: 0) [Both]:: | ||
| Write ALTO XML output to a `.xml` file. ALTO (Analyzed Layout and Text | ||
| Object) is a standard XML schema for describing the layout and content of | ||
| pages. Use the *alto* config file to enable this format. | ||
|
|
||
| *tessedit_create_page_xml* (bool, default: 0) [Both]:: | ||
| Write PAGE XML output to a `.page.xml` file. PAGE is a standard XML format | ||
| for ground truth and OCR results used in document image analysis competitions. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PAGE XML, strange emphasis on competitions, it is a general purpose format used in Digital humanities projects |
||
| Use the *page* config file to enable this format. | ||
|
|
||
| *page_xml_polygon* (bool, default: 1) [Both]:: | ||
| When writing PAGE XML output, create polygon outlines around text regions | ||
| instead of simple bounding boxes. | ||
|
|
||
| *page_xml_level* (int, default: 0) [Both]:: | ||
| Granularity of PAGE XML output: 0 = line level, 1 = word level. | ||
|
|
||
| *tessedit_create_tsv* (bool, default: 0) [Both]:: | ||
| Write tab-separated-values output to a `.tsv` file. Each recognized word is | ||
| output as one row with its bounding box, confidence and text. Use the *tsv* | ||
| config file to enable this format. | ||
|
|
||
| *tessedit_create_pdf* (bool, default: 0) [Both]:: | ||
| Write a searchable PDF to a `.pdf` file. The PDF contains the original image | ||
| with an invisible text layer for copy-paste and searching. Use the *pdf* | ||
| config file to enable this format. | ||
|
|
||
| *textonly_pdf* (bool, default: 0) [Both]:: | ||
| Write a text-only PDF (no image, only invisible text) to a `.pdf` file. | ||
|
|
||
| *tessedit_create_boxfile* (bool, default: 0) [Both]:: | ||
| Write a Tesseract box file (`.box`) that lists each recognized character with | ||
| its bounding box. Used mainly for legacy engine training. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is used for legacy engine training should it be both? I'm not sure, but worth a thought. |
||
|
|
||
| *tessedit_create_wordstrbox* (bool, default: 0) [Both]:: | ||
| Write a WordStr-format box file (`.box`). Similar to *tessedit_create_boxfile* | ||
| but records whole words instead of individual characters. | ||
|
|
||
| *tessedit_create_lstmbox* (bool, default: 0) [LSTM]:: | ||
| Write an LSTM box file (`.box`) suitable for LSTM training. | ||
|
|
||
| *preserve_interword_spaces* (bool, default: 0) [Both]:: | ||
| Preserve multiple consecutive inter-word spaces in the output instead of | ||
| collapsing them to a single space. | ||
|
|
||
| CHARACTER SET PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *tessedit_char_whitelist* (string, default: "") [Both]:: | ||
| Restrict the set of characters that Tesseract will recognize to only those | ||
| listed in this string. For example, setting this to `0123456789` will | ||
| make Tesseract return only digits. An empty value (the default) means all | ||
| characters in the trained data are allowed. | ||
|
|
||
| *tessedit_char_blacklist* (string, default: "") [Both]:: | ||
| Prevent Tesseract from recognizing the characters listed in this string. | ||
| Blacklisted characters will never appear in the output. The blacklist is | ||
| applied after the whitelist. | ||
|
|
||
| *tessedit_char_unblacklist* (string, default: "") [Both]:: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a legacy parameter? More about the originals multiple passes. Bit odd having an unblacklist but that could be me and my thought process.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in commit 7e49f71. The descriptive text now uses "allowlist" and "exclusion list" instead of "whitelist"/"blacklist". All three parameter names ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth deprecating these terms and have the same functionality tied to tessedit_chars_allowed That way the terms still work so it's not a breaking change but the documentation surfaces better language. Feels like this would address your comment better. |
||
| Re-allow specific characters that were excluded by *tessedit_char_blacklist*. | ||
| Characters in this list override the blacklist. | ||
|
|
||
| IMAGE PROCESSING PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *thresholding_method* (int, default: 0) [Both]:: | ||
| Select the algorithm used to convert a greyscale image to binary before OCR: | ||
| 0 = Otsu global thresholding (default); | ||
| 1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting); | ||
| 2 = Sauvola local adaptive thresholding (best for heavily degraded documents). | ||
|
|
||
| *thresholding_window_size* (double, default: 0.33) [Both]:: | ||
| Window size (multiplied by image DPI) used to compute local statistics for | ||
| the Sauvola thresholding method (*thresholding_method* = 2). | ||
|
|
||
| *thresholding_kfactor* (double, default: 0.34) [Both]:: | ||
| Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2). | ||
| Controls how much the local variance reduces the threshold. Typical range: | ||
| 0.2 -- 0.5. Higher values produce more aggressive thresholding. | ||
|
|
||
| *thresholding_tile_size* (double, default: 0.33) [Both]:: | ||
| Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled | ||
| thresholding method (*thresholding_method* = 1). | ||
|
|
||
| *thresholding_smooth_kernel_size* (double, default: 0) [Both]:: | ||
| Kernel size for smoothing the threshold array produced by LeptonicaOtsu | ||
| (*thresholding_method* = 1). Use 0 for no smoothing. | ||
|
|
||
| *thresholding_score_fraction* (double, default: 0.1) [Both]:: | ||
| Fraction of the maximum Otsu score used by LeptonicaOtsu | ||
| (*thresholding_method* = 1). Use 0.0 for standard Otsu behaviour; | ||
| 0.1 is recommended for better robustness. | ||
|
|
||
| *tessedit_do_invert* (bool, default: 1) [Both]:: | ||
| Deprecated -- will be removed in a future release. When enabled, Tesseract | ||
| tries OCR on an inverted (white-on-black) copy of lines whose mean confidence | ||
| falls below *invert_threshold* and keeps the result with higher confidence. | ||
| To disable automatic inversion, set *invert_threshold* = 0 rather than | ||
| setting this parameter to 0. | ||
|
|
||
| *invert_threshold* (double, default: 0.7) [Both]:: | ||
| Mean confidence threshold below which Tesseract will also attempt OCR on | ||
| the inverted image. Lower values make inversion less likely. Set to 0 to | ||
| disable automatic inversion entirely (preferred over setting | ||
| *tessedit_do_invert* = 0, which is deprecated). | ||
|
|
||
| *user_defined_dpi* (int, default: 0) [Both]:: | ||
| Override the resolution of the input image in DPI. Use this when the image | ||
| metadata contains an incorrect or missing DPI value. A value of 0 means | ||
| the resolution is read from the image metadata or guessed automatically. | ||
| This parameter is equivalent to the *--dpi* command-line option; when *--dpi* | ||
| is given on the command line it simply sets this parameter. | ||
|
|
||
| DICTIONARY PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| *load_system_dawg* (bool, default: 1) [Both]:: | ||
| Load the main system word list (DAWG) from the traineddata file. Disabling | ||
| this can speed up recognition and may improve results when OCR-ing content | ||
| that does not resemble natural language (e.g. codes, identifiers). | ||
|
|
||
| *load_freq_dawg* (bool, default: 1) [Both]:: | ||
| Load the list of frequent words from the traineddata file. | ||
|
|
||
| *load_unambig_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the list of unambiguous words from the traineddata file. | ||
|
|
||
| *load_punc_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing punctuation patterns from the traineddata file. | ||
|
|
||
| *load_number_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing number patterns from the traineddata file. | ||
|
|
||
| *load_bigram_dawg* (bool, default: 1) [Legacy]:: | ||
| Load the dawg containing special word bigrams from the traineddata file. | ||
|
|
||
| *user_words_file* (string, default: "") [Both]:: | ||
| Path to a plain-text file containing additional words (one per line) that | ||
| Tesseract should treat as valid dictionary words. | ||
|
|
||
| *user_words_suffix* (string, default: "") [Both]:: | ||
| Filename suffix (relative to the tessdata directory) for a per-language file | ||
| of additional valid words. For example, setting this to `user-words` causes | ||
| Tesseract to look for `eng.user-words` when using the English model. | ||
|
|
||
| *user_patterns_file* (string, default: "") [Both]:: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should you wish to beef this up a tiny bit... Defined in dict/trie.h, but in simplified terms: A becomes uppercase letter Other symbols match themselves. |
||
| Path to a plain-text file containing additional pattern strings that | ||
| Tesseract should accept as valid words. See `dict/trie.h` for the pattern | ||
| format. | ||
|
|
||
| *user_patterns_suffix* (string, default: "") [Both]:: | ||
| Filename suffix (relative to the tessdata directory) for a per-language file | ||
| of additional patterns. | ||
|
|
||
| LSTM ENGINE PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| These parameters are only meaningful when using the LSTM OCR engine | ||
| (*--oem 1* or *--oem 2*). | ||
|
|
||
| *lstm_use_matrix* (bool, default: 1) [LSTM]:: | ||
| Use the ratings matrix and beam search during LSTM decoding. Disabling this | ||
| reverts to a simpler greedy decoding strategy. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be worth mentioning it's a little faster turned off and required very clean text? Can't imagine a use case for me disabling this one. |
||
|
|
||
| *lstm_choice_mode* (int, default: 0) [LSTM]:: | ||
| Enables alternative character hypotheses in hOCR output: | ||
| 0 = disabled (default); | ||
| 1 = include per-timestep alternative choices; | ||
| 2 = extract alternative choices from the CTC output mapped per character. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand. |
||
|
|
||
| *lstm_choice_iterations* (int, default: 5) [LSTM]:: | ||
| Number of cascading beam-search iterations used when *lstm_choice_mode* is | ||
| non-zero. | ||
|
|
||
| *lstm_rating_coefficient* (double, default: 5) [LSTM]:: | ||
| Scaling factor applied to LSTM character ratings. Smaller values produce | ||
| higher (better) confidence scores and preserve more information before the | ||
| zero cut-off. The default value is 5. | ||
|
|
||
| LEGACY ENGINE PARAMETERS | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The following parameters apply only when using the legacy Tesseract engine | ||
| (*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy | ||
| model such as those from https://github.com/tesseract-ocr/tessdata). | ||
|
|
||
| *tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]:: | ||
| Apply bigram-based correction to improve recognition of adjacent words that | ||
| form common pairs. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IE 'is a' or 'lived at' etc. (For those not familiar with bigram corrections.) |
||
|
|
||
| *tessedit_enable_dict_correction* (bool, default: 0) [Legacy]:: | ||
| Use the dictionary to post-correct uncertain word hypotheses. | ||
|
|
||
| *tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]:: | ||
| Try to fix spaces that were ambiguously classified as inter-word or | ||
| inter-character gaps. | ||
|
|
||
| *language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]:: | ||
| Penalty added to the score of word hypotheses that do not appear in the | ||
| dictionary. Increase to bias recognition more strongly towards dictionary | ||
| words. | ||
|
|
||
| *language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]:: | ||
| Additional penalty for words that are in the dictionary but not in the list | ||
| of frequent words. | ||
|
|
||
| *language_model_penalty_case* (double, default: 0.1) [Legacy]:: | ||
| Penalty applied when the capitalisation of a recognised word is inconsistent | ||
| with the surrounding context. | ||
|
|
||
| *language_model_penalty_script* (double, default: 0.5) [Legacy]:: | ||
| Penalty applied when a recognised character belongs to a different script | ||
| from the surrounding text. | ||
|
|
||
| *language_model_penalty_punc* (double, default: 0.2) [Legacy]:: | ||
| Penalty applied for punctuation usage that is inconsistent with the language | ||
| model. | ||
|
|
||
| *wordrec_enable_assoc* (bool, default: 1) [Legacy]:: | ||
| Enable the associator, which considers combinations of character fragments | ||
| when forming word hypotheses. Disabling may speed up recognition at the | ||
| cost of accuracy on fragmented characters. | ||
|
|
||
| DEBUG PARAMETERS | ||
| ~~~~~~~~~~~~~~~~ | ||
|
|
||
| *debug_file* (string, default: "") [Both]:: | ||
| Redirect Tesseract debug/diagnostic output to this file instead of stderr. | ||
| Set to `/dev/null` (or use the *quiet* config file) to suppress all debug | ||
| output. | ||
|
|
||
| *tessedit_write_params_to_file* (string, default: "") [Both]:: | ||
| If set to a filename, Tesseract will write the values of all its parameters | ||
| to that file when it starts up. Useful for capturing the effective | ||
| configuration for debugging or reproducibility. | ||
|
|
||
|
|
||
| ENVIRONMENT VARIABLES | ||
| --------------------- | ||
| *`TESSDATA_PREFIX`*:: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀