tesseract-ocr · Copilot · Mar 16, 2026 · Mar 16, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc
@@ -406,6 +406,318 @@ one per line.  The format of the latter is documented in 'dict/trie.h'
 on 'read_pattern_list()'.
 
 
+[[PARAMETERS]]
+PARAMETERS
+----------
+
+Tesseract parameters control the behaviour of the OCR engine and can be set
+using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by
+placing them in a <<CONFIGFILE,'CONFIGFILE'>>.
+The example above restricts recognized characters to digits only.
+Run *--print-parameters* to list all available parameters with their current
+values and short descriptions.
+
+The engine column indicates which OCR engine modes support a parameter: +
+*Both* -- works with both the LSTM and Legacy engines; +
+*LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); +
+*Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2).
+
+OUTPUT FORMAT PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+*tessedit_create_txt* (bool, default: 0) [Both]::
+  Write plain-text output to a `.txt` file.  This is the default output format
+  when no config file or *-c* option overrides it.
+
+*tessedit_create_hocr* (bool, default: 0) [Both]::
+  Write hOCR output to a `.hocr` file.  hOCR is an HTML-based format that
+  encodes the OCR results together with their bounding boxes and confidences.
+  Use the *hocr* config file to enable this format.
+  See the hOCR specification at https://kba.github.io/hocr-spec/1.2/ for
+  details of the output format.
+
+*hocr_font_info* (bool, default: 0) [Both]::
+  Add per-word font metadata to hOCR output.  When enabled, each word span
+  includes `x_font` (font name) and `x_fsize` (point size) attributes in its
+  `title` field.  Font information is derived from the recognition results and
+  is most reliable with the Legacy OCR engine; the LSTM engine may produce
+  less precise font names.
+
+*hocr_char_boxes* (bool, default: 0) [Both]::
+  Add per-character bounding-box coordinates to hOCR output as `ocrx_cinfo`
+  spans.  Note that character-level bounding boxes may be less precise when
+  using the LSTM engine, because LSTM operates at the line level and
+  character positions are approximated from the line result.
+  See also *lstm_choice_mode* for LSTM-specific character alternative output.
+
+*tessedit_create_alto* (bool, default: 0) [Both]::
+  Write ALTO XML output to a `.xml` file.  ALTO (Analyzed Layout and Text
+  Object) is a standard XML schema for describing the layout and content of
+  pages.  Use the *alto* config file to enable this format.
+
+*tessedit_create_page_xml* (bool, default: 0) [Both]::
+  Write PAGE XML output to a `.page.xml` file.  PAGE (Page Analysis and
+  Ground Truth Elements) is a standard XML format widely used in digital
+  humanities projects, library and archive workflows, and document annotation
+  tools such as Transkribus and eScriptorium.
+  See https://github.com/PRImA-Research-Lab/PAGE-XML for the specification.
+  Use the *page* config file to enable this format.
+
+*page_xml_polygon* (bool, default: 1) [Both]::
+  When writing PAGE XML output, create polygon outlines around text regions
+  instead of simple bounding boxes.
+
+*page_xml_level* (int, default: 0) [Both]::
+  Granularity of PAGE XML output: 0 = line level, 1 = word level.
+
+*tessedit_create_tsv* (bool, default: 0) [Both]::
+  Write tab-separated-values output to a `.tsv` file.  Each recognized word is
+  output as one row with its bounding box, confidence and text.  Use the *tsv*
+  config file to enable this format.
+
+*tessedit_create_pdf* (bool, default: 0) [Both]::
+  Write a searchable PDF to a `.pdf` file.  The PDF contains the original image
+  with an invisible text layer for copy-paste and searching.  Use the *pdf*
+  config file to enable this format.
+
+*textonly_pdf* (bool, default: 0) [Both]::
+  Write a text-only PDF (no image, only invisible text) to a `.pdf` file.
+
+*tessedit_create_boxfile* (bool, default: 0) [Both]::
+  Write a Tesseract box file (`.box`) that lists each recognized character with
+  its bounding box, one per line.  Can be produced by either engine.  These
+  files are primarily used as ground truth for legacy engine training.
+
+*tessedit_create_wordstrbox* (bool, default: 0) [Both]::
+  Write a WordStr-format box file (`.box`).  Similar to *tessedit_create_boxfile*
+  but records whole words instead of individual characters.
+
+*tessedit_create_lstmbox* (bool, default: 0) [LSTM]::
+  Write an LSTM box file (`.box`) suitable for LSTM training.
+
+*preserve_interword_spaces* (bool, default: 0) [Both]::
+  Preserve multiple consecutive inter-word spaces in the output instead of
+  collapsing them to a single space.
+
+CHARACTER SET PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+*tessedit_char_whitelist* (string, default: "") [Both]::
+  Restrict the set of characters that Tesseract will recognize to only those
+  listed in this string (allowlist).  For example, setting this to `0123456789`
+  will make Tesseract return only digits.  An empty value (the default) means
+  all characters in the trained data are allowed.
+
+*tessedit_char_blacklist* (string, default: "") [Both]::
+  Prevent Tesseract from recognizing the characters listed in this string.
+  Characters in this list will never appear in the output.  This exclusion list
+  is applied after the allowlist (*tessedit_char_whitelist*).
+
+*tessedit_char_unblacklist* (string, default: "") [Both]::
+  Re-allow specific characters that were excluded by *tessedit_char_blacklist*.
+  Characters in this list override the exclusion list and apply to both the
+  LSTM and Legacy engines.
+
+IMAGE PROCESSING PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*thresholding_method* (int, default: 0) [Both]::
+  Select the algorithm used to convert a greyscale image to binary before OCR:
+  0 = Otsu global thresholding (default);
+  1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting);
+  2 = Sauvola local adaptive thresholding (best for heavily degraded documents).
+
+*thresholding_window_size* (double, default: 0.33) [Both]::
+  Window size (multiplied by image DPI) used to compute local statistics for
+  the Sauvola thresholding method (*thresholding_method* = 2).
+
+*thresholding_kfactor* (double, default: 0.34) [Both]::
+  Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2).
+  Controls how much the local variance reduces the threshold.  Typical range:
+  0.2 -- 0.5.  Higher values produce more aggressive thresholding.
+
+*thresholding_tile_size* (double, default: 0.33) [Both]::
+  Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled
+  thresholding method (*thresholding_method* = 1).
+
+*thresholding_smooth_kernel_size* (double, default: 0) [Both]::
+  Kernel size for smoothing the threshold array produced by LeptonicaOtsu
+  (*thresholding_method* = 1).  Use 0 for no smoothing.
+
+*thresholding_score_fraction* (double, default: 0.1) [Both]::
+  Fraction of the maximum Otsu score used by LeptonicaOtsu
+  (*thresholding_method* = 1).  Use 0.0 for standard Otsu behaviour;
+  0.1 is recommended for better robustness.
+
+*tessedit_do_invert* (bool, default: 1) [Both]::
+  Deprecated -- will be removed in a future release.  When enabled, Tesseract
+  tries OCR on an inverted (white-on-black) copy of lines whose mean confidence
+  falls below *invert_threshold* and keeps the result with higher confidence.
+  To disable automatic inversion, set *invert_threshold* = 0 rather than
+  setting this parameter to 0.
+
+*invert_threshold* (double, default: 0.7) [Both]::
+  Mean confidence threshold below which Tesseract will also attempt OCR on
+  the inverted image.  Lower values make inversion less likely.  Set to 0 to
+  disable automatic inversion entirely (preferred over setting
+  *tessedit_do_invert* = 0, which is deprecated).
+
+*user_defined_dpi* (int, default: 0) [Both]::
+  Override the resolution of the input image in DPI.  Use this when the image
+  metadata contains an incorrect or missing DPI value.  A value of 0 means
+  the resolution is read from the image metadata or guessed automatically.
+  This parameter is equivalent to the *--dpi* command-line option; when *--dpi*
+  is given on the command line it simply sets this parameter.
+
+*textord_heavy_nr* (bool, default: 0) [Both]::
+  Aggressively remove noise blobs during page layout analysis.  When disabled
+  (the default), Tesseract applies a moderate noise threshold that preserves
+  most legitimate characters.  Enabling this raises the threshold significantly,
+  which can improve results on scans with heavy speckle or background noise, but
+  may also remove small legitimate characters such as punctuation marks,
+  diacritics, or small symbols.  Useful when OCR-ing degraded or low-quality
+  document scans where accuracy on punctuation is less important than overall
+  text extraction.
+
+DICTIONARY PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~
+
+*load_system_dawg* (bool, default: 1) [Both]::
+  Load the main system word list (DAWG) from the traineddata file.  Disabling
+  this can speed up recognition and may improve results when OCR-ing content
+  that does not resemble natural language (e.g. codes, identifiers).
+
+*load_freq_dawg* (bool, default: 1) [Both]::
+  Load the list of frequent words from the traineddata file.
+
+*load_unambig_dawg* (bool, default: 1) [Legacy]::
+  Load the list of unambiguous words from the traineddata file.
+
+*load_punc_dawg* (bool, default: 1) [Legacy]::
+  Load the dawg containing punctuation patterns from the traineddata file.
+
+*load_number_dawg* (bool, default: 1) [Legacy]::
+  Load the dawg containing number patterns from the traineddata file.
+
+*load_bigram_dawg* (bool, default: 1) [Legacy]::
+  Load the dawg containing special word bigrams from the traineddata file.
+
+*user_words_file* (string, default: "") [Both]::
+  Path to a plain-text file containing additional words (one per line) that
+  Tesseract should treat as valid dictionary words.
+
+*user_words_suffix* (string, default: "") [Both]::
+  Filename suffix (relative to the tessdata directory) for a per-language file
+  of additional valid words.  For example, setting this to `user-words` causes
+  Tesseract to look for `eng.user-words` when using the English model.
+
+*user_patterns_file* (string, default: "") [Both]::
+  Path to a plain-text file containing additional pattern strings (one per
+  line) that Tesseract should accept as valid words.  In the pattern language,
+  backslash-escaped sequences specify character classes:
+  `\d` = any digit;
+  `\c` = any letter;
+  `\a` = any lowercase letter;
+  `\A` = any uppercase letter;
+  `\n` = any alphanumeric character;
+  `\p` = any punctuation character.
+  All other characters match themselves.  For example, `1-\d\d\d-GOOG-411`
+  matches a phone-number-like string.
+  These are structural templates, not regular expressions.  For the full
+  pattern syntax, see `dict/trie.h` in the Tesseract source.
+
+*user_patterns_suffix* (string, default: "") [Both]::
+  Filename suffix (relative to the tessdata directory) for a per-language file
+  of additional patterns.
+
+LSTM ENGINE PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~~
+
+These parameters are only meaningful when using the LSTM OCR engine
+(*--oem 1* or *--oem 2*).
+
+*lstm_use_matrix* (bool, default: 1) [LSTM]::
+  Use the ratings matrix and beam search during LSTM decoding.  Disabling this
+  reverts to a simpler, faster greedy decoding strategy that may be adequate
+  for very clean, high-quality images but generally gives lower accuracy.
+
+*lstm_choice_mode* (int, default: 0) [LSTM]::
+  Enables alternative character hypotheses in hOCR output (requires
+  *tessedit_create_hocr* = 1):
+  0 = disabled (default);
+  1 = include per-timestep alternative choices;
+  2 = extract alternative choices from the CTC output mapped per character.
+  See also *hocr_char_boxes* for character bounding-box output.
+
+*lstm_choice_iterations* (int, default: 5) [LSTM]::
+  Number of cascading beam-search iterations used when *lstm_choice_mode* is
+  non-zero.
+
+*lstm_rating_coefficient* (double, default: 5) [LSTM]::
+  Scaling factor applied to LSTM character ratings.  Smaller values produce
+  higher (better) confidence scores and preserve more information before the
+  zero cut-off.  The default value is 5.
+
+LEGACY ENGINE PARAMETERS
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following parameters apply only when using the legacy Tesseract engine
+(*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy
+model such as those from https://github.com/tesseract-ocr/tessdata).
+
+*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]::
+  Apply bigram-based correction to improve recognition of adjacent word pairs
+  that commonly appear together (e.g. "is a", "in the", "New York").  The
+  correction uses a bigram dictionary from the traineddata file to re-score
+  word hypotheses in context.
+
+*tessedit_enable_dict_correction* (bool, default: 0) [Legacy]::
+  Use the dictionary to post-correct uncertain word hypotheses.
+
+*tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]::
+  Try to fix spaces that were ambiguously classified as inter-word or
+  inter-character gaps.
+
+*language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]::
+  Penalty added to the score of word hypotheses that do not appear in the
+  dictionary.  Increase to bias recognition more strongly towards dictionary
+  words.
+
+*language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]::
+  Additional penalty for words that are in the dictionary but not in the list
+  of frequent words.
+
+*language_model_penalty_case* (double, default: 0.1) [Legacy]::
+  Penalty applied when the capitalisation of a recognised word is inconsistent
+  with the surrounding context.
+
+*language_model_penalty_script* (double, default: 0.5) [Legacy]::
+  Penalty applied when a recognised character belongs to a different script
+  from the surrounding text.
+
+*language_model_penalty_punc* (double, default: 0.2) [Legacy]::
+  Penalty applied for punctuation usage that is inconsistent with the language
+  model.
+
+*wordrec_enable_assoc* (bool, default: 1) [Legacy]::
+  Enable the associator, which considers combinations of character fragments
+  when forming word hypotheses.  Disabling may speed up recognition at the
+  cost of accuracy on fragmented characters.
+
+DEBUG PARAMETERS
+~~~~~~~~~~~~~~~~
+
+*debug_file* (string, default: "") [Both]::
+  Redirect Tesseract debug/diagnostic output to this file instead of stderr.
+  Set to `/dev/null` (or use the *quiet* config file) to suppress all debug
+  output.
+
+*tessedit_write_params_to_file* (string, default: "") [Both]::
+  If set to a filename, Tesseract will write the values of all its parameters
+  to that file when it starts up.  Useful for capturing the effective
+  configuration for debugging or reproducibility.
+
+
 ENVIRONMENT VARIABLES
 ---------------------
 *`TESSDATA_PREFIX`*::