diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index a0a0fc4fd5..a4bd5b374a 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -406,6 +406,318 @@ one per line. The format of the latter is documented in 'dict/trie.h' on 'read_pattern_list()'. +[[PARAMETERS]] +PARAMETERS +---------- + +Tesseract parameters control the behaviour of the OCR engine and can be set +using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by +placing them in a <>. +The example above restricts recognized characters to digits only. +Run *--print-parameters* to list all available parameters with their current +values and short descriptions. + +The engine column indicates which OCR engine modes support a parameter: + +*Both* -- works with both the LSTM and Legacy engines; + +*LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); + +*Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2). + +OUTPUT FORMAT PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~~~ + +*tessedit_create_txt* (bool, default: 0) [Both]:: + Write plain-text output to a `.txt` file. This is the default output format + when no config file or *-c* option overrides it. + +*tessedit_create_hocr* (bool, default: 0) [Both]:: + Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that + encodes the OCR results together with their bounding boxes and confidences. + Use the *hocr* config file to enable this format. + See the hOCR specification at https://kba.github.io/hocr-spec/1.2/ for + details of the output format. + +*hocr_font_info* (bool, default: 0) [Both]:: + Add per-word font metadata to hOCR output. When enabled, each word span + includes `x_font` (font name) and `x_fsize` (point size) attributes in its + `title` field. Font information is derived from the recognition results and + is most reliable with the Legacy OCR engine; the LSTM engine may produce + less precise font names. + +*hocr_char_boxes* (bool, default: 0) [Both]:: + Add per-character bounding-box coordinates to hOCR output as `ocrx_cinfo` + spans. Note that character-level bounding boxes may be less precise when + using the LSTM engine, because LSTM operates at the line level and + character positions are approximated from the line result. + See also *lstm_choice_mode* for LSTM-specific character alternative output. + +*tessedit_create_alto* (bool, default: 0) [Both]:: + Write ALTO XML output to a `.xml` file. ALTO (Analyzed Layout and Text + Object) is a standard XML schema for describing the layout and content of + pages. Use the *alto* config file to enable this format. + +*tessedit_create_page_xml* (bool, default: 0) [Both]:: + Write PAGE XML output to a `.page.xml` file. PAGE (Page Analysis and + Ground Truth Elements) is a standard XML format widely used in digital + humanities projects, library and archive workflows, and document annotation + tools such as Transkribus and eScriptorium. + See https://github.com/PRImA-Research-Lab/PAGE-XML for the specification. + Use the *page* config file to enable this format. + +*page_xml_polygon* (bool, default: 1) [Both]:: + When writing PAGE XML output, create polygon outlines around text regions + instead of simple bounding boxes. + +*page_xml_level* (int, default: 0) [Both]:: + Granularity of PAGE XML output: 0 = line level, 1 = word level. + +*tessedit_create_tsv* (bool, default: 0) [Both]:: + Write tab-separated-values output to a `.tsv` file. Each recognized word is + output as one row with its bounding box, confidence and text. Use the *tsv* + config file to enable this format. + +*tessedit_create_pdf* (bool, default: 0) [Both]:: + Write a searchable PDF to a `.pdf` file. The PDF contains the original image + with an invisible text layer for copy-paste and searching. Use the *pdf* + config file to enable this format. + +*textonly_pdf* (bool, default: 0) [Both]:: + Write a text-only PDF (no image, only invisible text) to a `.pdf` file. + +*tessedit_create_boxfile* (bool, default: 0) [Both]:: + Write a Tesseract box file (`.box`) that lists each recognized character with + its bounding box, one per line. Can be produced by either engine. These + files are primarily used as ground truth for legacy engine training. + +*tessedit_create_wordstrbox* (bool, default: 0) [Both]:: + Write a WordStr-format box file (`.box`). Similar to *tessedit_create_boxfile* + but records whole words instead of individual characters. + +*tessedit_create_lstmbox* (bool, default: 0) [LSTM]:: + Write an LSTM box file (`.box`) suitable for LSTM training. + +*preserve_interword_spaces* (bool, default: 0) [Both]:: + Preserve multiple consecutive inter-word spaces in the output instead of + collapsing them to a single space. + +CHARACTER SET PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~~~ + +*tessedit_char_whitelist* (string, default: "") [Both]:: + Restrict the set of characters that Tesseract will recognize to only those + listed in this string (allowlist). For example, setting this to `0123456789` + will make Tesseract return only digits. An empty value (the default) means + all characters in the trained data are allowed. + +*tessedit_char_blacklist* (string, default: "") [Both]:: + Prevent Tesseract from recognizing the characters listed in this string. + Characters in this list will never appear in the output. This exclusion list + is applied after the allowlist (*tessedit_char_whitelist*). + +*tessedit_char_unblacklist* (string, default: "") [Both]:: + Re-allow specific characters that were excluded by *tessedit_char_blacklist*. + Characters in this list override the exclusion list and apply to both the + LSTM and Legacy engines. + +IMAGE PROCESSING PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +*thresholding_method* (int, default: 0) [Both]:: + Select the algorithm used to convert a greyscale image to binary before OCR: + 0 = Otsu global thresholding (default); + 1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting); + 2 = Sauvola local adaptive thresholding (best for heavily degraded documents). + +*thresholding_window_size* (double, default: 0.33) [Both]:: + Window size (multiplied by image DPI) used to compute local statistics for + the Sauvola thresholding method (*thresholding_method* = 2). + +*thresholding_kfactor* (double, default: 0.34) [Both]:: + Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2). + Controls how much the local variance reduces the threshold. Typical range: + 0.2 -- 0.5. Higher values produce more aggressive thresholding. + +*thresholding_tile_size* (double, default: 0.33) [Both]:: + Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled + thresholding method (*thresholding_method* = 1). + +*thresholding_smooth_kernel_size* (double, default: 0) [Both]:: + Kernel size for smoothing the threshold array produced by LeptonicaOtsu + (*thresholding_method* = 1). Use 0 for no smoothing. + +*thresholding_score_fraction* (double, default: 0.1) [Both]:: + Fraction of the maximum Otsu score used by LeptonicaOtsu + (*thresholding_method* = 1). Use 0.0 for standard Otsu behaviour; + 0.1 is recommended for better robustness. + +*tessedit_do_invert* (bool, default: 1) [Both]:: + Deprecated -- will be removed in a future release. When enabled, Tesseract + tries OCR on an inverted (white-on-black) copy of lines whose mean confidence + falls below *invert_threshold* and keeps the result with higher confidence. + To disable automatic inversion, set *invert_threshold* = 0 rather than + setting this parameter to 0. + +*invert_threshold* (double, default: 0.7) [Both]:: + Mean confidence threshold below which Tesseract will also attempt OCR on + the inverted image. Lower values make inversion less likely. Set to 0 to + disable automatic inversion entirely (preferred over setting + *tessedit_do_invert* = 0, which is deprecated). + +*user_defined_dpi* (int, default: 0) [Both]:: + Override the resolution of the input image in DPI. Use this when the image + metadata contains an incorrect or missing DPI value. A value of 0 means + the resolution is read from the image metadata or guessed automatically. + This parameter is equivalent to the *--dpi* command-line option; when *--dpi* + is given on the command line it simply sets this parameter. + +*textord_heavy_nr* (bool, default: 0) [Both]:: + Aggressively remove noise blobs during page layout analysis. When disabled + (the default), Tesseract applies a moderate noise threshold that preserves + most legitimate characters. Enabling this raises the threshold significantly, + which can improve results on scans with heavy speckle or background noise, but + may also remove small legitimate characters such as punctuation marks, + diacritics, or small symbols. Useful when OCR-ing degraded or low-quality + document scans where accuracy on punctuation is less important than overall + text extraction. + +DICTIONARY PARAMETERS +~~~~~~~~~~~~~~~~~~~~~ + +*load_system_dawg* (bool, default: 1) [Both]:: + Load the main system word list (DAWG) from the traineddata file. Disabling + this can speed up recognition and may improve results when OCR-ing content + that does not resemble natural language (e.g. codes, identifiers). + +*load_freq_dawg* (bool, default: 1) [Both]:: + Load the list of frequent words from the traineddata file. + +*load_unambig_dawg* (bool, default: 1) [Legacy]:: + Load the list of unambiguous words from the traineddata file. + +*load_punc_dawg* (bool, default: 1) [Legacy]:: + Load the dawg containing punctuation patterns from the traineddata file. + +*load_number_dawg* (bool, default: 1) [Legacy]:: + Load the dawg containing number patterns from the traineddata file. + +*load_bigram_dawg* (bool, default: 1) [Legacy]:: + Load the dawg containing special word bigrams from the traineddata file. + +*user_words_file* (string, default: "") [Both]:: + Path to a plain-text file containing additional words (one per line) that + Tesseract should treat as valid dictionary words. + +*user_words_suffix* (string, default: "") [Both]:: + Filename suffix (relative to the tessdata directory) for a per-language file + of additional valid words. For example, setting this to `user-words` causes + Tesseract to look for `eng.user-words` when using the English model. + +*user_patterns_file* (string, default: "") [Both]:: + Path to a plain-text file containing additional pattern strings (one per + line) that Tesseract should accept as valid words. In the pattern language, + backslash-escaped sequences specify character classes: + `\d` = any digit; + `\c` = any letter; + `\a` = any lowercase letter; + `\A` = any uppercase letter; + `\n` = any alphanumeric character; + `\p` = any punctuation character. + All other characters match themselves. For example, `1-\d\d\d-GOOG-411` + matches a phone-number-like string. + These are structural templates, not regular expressions. For the full + pattern syntax, see `dict/trie.h` in the Tesseract source. + +*user_patterns_suffix* (string, default: "") [Both]:: + Filename suffix (relative to the tessdata directory) for a per-language file + of additional patterns. + +LSTM ENGINE PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~ + +These parameters are only meaningful when using the LSTM OCR engine +(*--oem 1* or *--oem 2*). + +*lstm_use_matrix* (bool, default: 1) [LSTM]:: + Use the ratings matrix and beam search during LSTM decoding. Disabling this + reverts to a simpler, faster greedy decoding strategy that may be adequate + for very clean, high-quality images but generally gives lower accuracy. + +*lstm_choice_mode* (int, default: 0) [LSTM]:: + Enables alternative character hypotheses in hOCR output (requires + *tessedit_create_hocr* = 1): + 0 = disabled (default); + 1 = include per-timestep alternative choices; + 2 = extract alternative choices from the CTC output mapped per character. + See also *hocr_char_boxes* for character bounding-box output. + +*lstm_choice_iterations* (int, default: 5) [LSTM]:: + Number of cascading beam-search iterations used when *lstm_choice_mode* is + non-zero. + +*lstm_rating_coefficient* (double, default: 5) [LSTM]:: + Scaling factor applied to LSTM character ratings. Smaller values produce + higher (better) confidence scores and preserve more information before the + zero cut-off. The default value is 5. + +LEGACY ENGINE PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~~~ + +The following parameters apply only when using the legacy Tesseract engine +(*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy +model such as those from https://github.com/tesseract-ocr/tessdata). + +*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]:: + Apply bigram-based correction to improve recognition of adjacent word pairs + that commonly appear together (e.g. "is a", "in the", "New York"). The + correction uses a bigram dictionary from the traineddata file to re-score + word hypotheses in context. + +*tessedit_enable_dict_correction* (bool, default: 0) [Legacy]:: + Use the dictionary to post-correct uncertain word hypotheses. + +*tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]:: + Try to fix spaces that were ambiguously classified as inter-word or + inter-character gaps. + +*language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]:: + Penalty added to the score of word hypotheses that do not appear in the + dictionary. Increase to bias recognition more strongly towards dictionary + words. + +*language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]:: + Additional penalty for words that are in the dictionary but not in the list + of frequent words. + +*language_model_penalty_case* (double, default: 0.1) [Legacy]:: + Penalty applied when the capitalisation of a recognised word is inconsistent + with the surrounding context. + +*language_model_penalty_script* (double, default: 0.5) [Legacy]:: + Penalty applied when a recognised character belongs to a different script + from the surrounding text. + +*language_model_penalty_punc* (double, default: 0.2) [Legacy]:: + Penalty applied for punctuation usage that is inconsistent with the language + model. + +*wordrec_enable_assoc* (bool, default: 1) [Legacy]:: + Enable the associator, which considers combinations of character fragments + when forming word hypotheses. Disabling may speed up recognition at the + cost of accuracy on fragmented characters. + +DEBUG PARAMETERS +~~~~~~~~~~~~~~~~ + +*debug_file* (string, default: "") [Both]:: + Redirect Tesseract debug/diagnostic output to this file instead of stderr. + Set to `/dev/null` (or use the *quiet* config file) to suppress all debug + output. + +*tessedit_write_params_to_file* (string, default: "") [Both]:: + If set to a filename, Tesseract will write the values of all its parameters + to that file when it starts up. Useful for capturing the effective + configuration for debugging or reproducibility. + + ENVIRONMENT VARIABLES --------------------- *`TESSDATA_PREFIX`*::