Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
312 changes: 312 additions & 0 deletions doc/tesseract.1.asc
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,318 @@ one per line. The format of the latter is documented in 'dict/trie.h'
on 'read_pattern_list()'.


[[PARAMETERS]]
PARAMETERS
----------

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀

Tesseract parameters control the behaviour of the OCR engine and can be set
using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by
placing them in a <<CONFIGFILE,'CONFIGFILE'>>.
The example above restricts recognized characters to digits only.
Run *--print-parameters* to list all available parameters with their current
values and short descriptions.

The engine column indicates which OCR engine modes support a parameter: +
*Both* -- works with both the LSTM and Legacy engines; +
*LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); +
*Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2).

OUTPUT FORMAT PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

*tessedit_create_txt* (bool, default: 0) [Both]::
Write plain-text output to a `.txt` file. This is the default output format
when no config file or *-c* option overrides it.

*tessedit_create_hocr* (bool, default: 0) [Both]::
Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that
encodes the OCR results together with their bounding boxes and confidences.
Use the *hocr* config file to enable this format.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a link here to the specification may be useful
https://kba.github.io/hocr-spec/1.2/

See the hOCR specification at https://kba.github.io/hocr-spec/1.2/ for
details of the output format.

*hocr_font_info* (bool, default: 0) [Both]::
Add per-word font metadata to hOCR output. When enabled, each word span
includes `x_font` (font name) and `x_fsize` (point size) attributes in its
`title` field. Font information is derived from the recognition results and
is most reliable with the Legacy OCR engine; the LSTM engine may produce
less precise font names.

*hocr_char_boxes* (bool, default: 0) [Both]::
Add per-character bounding-box coordinates to hOCR output as `ocrx_cinfo`
spans. Note that character-level bounding boxes may be less precise when
using the LSTM engine, because LSTM operates at the line level and
character positions are approximated from the line result.
See also *lstm_choice_mode* for LSTM-specific character alternative output.

*tessedit_create_alto* (bool, default: 0) [Both]::
Write ALTO XML output to a `.xml` file. ALTO (Analyzed Layout and Text
Object) is a standard XML schema for describing the layout and content of
pages. Use the *alto* config file to enable this format.

*tessedit_create_page_xml* (bool, default: 0) [Both]::
Write PAGE XML output to a `.page.xml` file. PAGE (Page Analysis and
Ground Truth Elements) is a standard XML format widely used in digital
humanities projects, library and archive workflows, and document annotation
tools such as Transkribus and eScriptorium.
See https://github.com/PRImA-Research-Lab/PAGE-XML for the specification.
Use the *page* config file to enable this format.

*page_xml_polygon* (bool, default: 1) [Both]::
When writing PAGE XML output, create polygon outlines around text regions
instead of simple bounding boxes.

*page_xml_level* (int, default: 0) [Both]::
Granularity of PAGE XML output: 0 = line level, 1 = word level.

*tessedit_create_tsv* (bool, default: 0) [Both]::
Write tab-separated-values output to a `.tsv` file. Each recognized word is
output as one row with its bounding box, confidence and text. Use the *tsv*
config file to enable this format.

*tessedit_create_pdf* (bool, default: 0) [Both]::
Write a searchable PDF to a `.pdf` file. The PDF contains the original image
with an invisible text layer for copy-paste and searching. Use the *pdf*
config file to enable this format.

*textonly_pdf* (bool, default: 0) [Both]::
Write a text-only PDF (no image, only invisible text) to a `.pdf` file.

*tessedit_create_boxfile* (bool, default: 0) [Both]::
Write a Tesseract box file (`.box`) that lists each recognized character with
its bounding box, one per line. Can be produced by either engine. These
files are primarily used as ground truth for legacy engine training.

*tessedit_create_wordstrbox* (bool, default: 0) [Both]::
Write a WordStr-format box file (`.box`). Similar to *tessedit_create_boxfile*
but records whole words instead of individual characters.

*tessedit_create_lstmbox* (bool, default: 0) [LSTM]::
Write an LSTM box file (`.box`) suitable for LSTM training.

*preserve_interword_spaces* (bool, default: 0) [Both]::
Preserve multiple consecutive inter-word spaces in the output instead of
collapsing them to a single space.

CHARACTER SET PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

*tessedit_char_whitelist* (string, default: "") [Both]::
Restrict the set of characters that Tesseract will recognize to only those
listed in this string (allowlist). For example, setting this to `0123456789`
will make Tesseract return only digits. An empty value (the default) means
all characters in the trained data are allowed.

*tessedit_char_blacklist* (string, default: "") [Both]::
Prevent Tesseract from recognizing the characters listed in this string.
Characters in this list will never appear in the output. This exclusion list
is applied after the allowlist (*tessedit_char_whitelist*).

*tessedit_char_unblacklist* (string, default: "") [Both]::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a legacy parameter? More about the originals multiple passes. Bit odd having an unblacklist but that could be me and my thought process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 7e49f71. The descriptive text now uses "allowlist" and "exclusion list" instead of "whitelist"/"blacklist". All three parameter names (tessedit_char_whitelist, tessedit_char_blacklist, tessedit_char_unblacklist) are unchanged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth deprecating these terms and have the same functionality tied to

tessedit_chars_allowed
tessedit_chars_excluded
tessedit_chars_unexcluded

That way the terms still work so it's not a breaking change but the documentation surfaces better language. Feels like this would address your comment better.

Re-allow specific characters that were excluded by *tessedit_char_blacklist*.
Characters in this list override the exclusion list and apply to both the
LSTM and Legacy engines.

IMAGE PROCESSING PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~~~~

*thresholding_method* (int, default: 0) [Both]::
Select the algorithm used to convert a greyscale image to binary before OCR:
0 = Otsu global thresholding (default);
1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting);
2 = Sauvola local adaptive thresholding (best for heavily degraded documents).

*thresholding_window_size* (double, default: 0.33) [Both]::
Window size (multiplied by image DPI) used to compute local statistics for
the Sauvola thresholding method (*thresholding_method* = 2).

*thresholding_kfactor* (double, default: 0.34) [Both]::
Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2).
Controls how much the local variance reduces the threshold. Typical range:
0.2 -- 0.5. Higher values produce more aggressive thresholding.

*thresholding_tile_size* (double, default: 0.33) [Both]::
Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled
thresholding method (*thresholding_method* = 1).

*thresholding_smooth_kernel_size* (double, default: 0) [Both]::
Kernel size for smoothing the threshold array produced by LeptonicaOtsu
(*thresholding_method* = 1). Use 0 for no smoothing.

*thresholding_score_fraction* (double, default: 0.1) [Both]::
Fraction of the maximum Otsu score used by LeptonicaOtsu
(*thresholding_method* = 1). Use 0.0 for standard Otsu behaviour;
0.1 is recommended for better robustness.

*tessedit_do_invert* (bool, default: 1) [Both]::
Deprecated -- will be removed in a future release. When enabled, Tesseract
tries OCR on an inverted (white-on-black) copy of lines whose mean confidence
falls below *invert_threshold* and keeps the result with higher confidence.
To disable automatic inversion, set *invert_threshold* = 0 rather than
setting this parameter to 0.

*invert_threshold* (double, default: 0.7) [Both]::
Mean confidence threshold below which Tesseract will also attempt OCR on
the inverted image. Lower values make inversion less likely. Set to 0 to
disable automatic inversion entirely (preferred over setting
*tessedit_do_invert* = 0, which is deprecated).

*user_defined_dpi* (int, default: 0) [Both]::
Override the resolution of the input image in DPI. Use this when the image
metadata contains an incorrect or missing DPI value. A value of 0 means
the resolution is read from the image metadata or guessed automatically.
This parameter is equivalent to the *--dpi* command-line option; when *--dpi*
is given on the command line it simply sets this parameter.

*textord_heavy_nr* (bool, default: 0) [Both]::
Aggressively remove noise blobs during page layout analysis. When disabled
(the default), Tesseract applies a moderate noise threshold that preserves
most legitimate characters. Enabling this raises the threshold significantly,
which can improve results on scans with heavy speckle or background noise, but
may also remove small legitimate characters such as punctuation marks,
diacritics, or small symbols. Useful when OCR-ing degraded or low-quality
document scans where accuracy on punctuation is less important than overall
text extraction.

DICTIONARY PARAMETERS
~~~~~~~~~~~~~~~~~~~~~

*load_system_dawg* (bool, default: 1) [Both]::
Load the main system word list (DAWG) from the traineddata file. Disabling
this can speed up recognition and may improve results when OCR-ing content
that does not resemble natural language (e.g. codes, identifiers).

*load_freq_dawg* (bool, default: 1) [Both]::
Load the list of frequent words from the traineddata file.

*load_unambig_dawg* (bool, default: 1) [Legacy]::
Load the list of unambiguous words from the traineddata file.

*load_punc_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing punctuation patterns from the traineddata file.

*load_number_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing number patterns from the traineddata file.

*load_bigram_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing special word bigrams from the traineddata file.

*user_words_file* (string, default: "") [Both]::
Path to a plain-text file containing additional words (one per line) that
Tesseract should treat as valid dictionary words.

*user_words_suffix* (string, default: "") [Both]::
Filename suffix (relative to the tessdata directory) for a per-language file
of additional valid words. For example, setting this to `user-words` causes
Tesseract to look for `eng.user-words` when using the English model.

*user_patterns_file* (string, default: "") [Both]::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you wish to beef this up a tiny bit...

Defined in dict/trie.h, but in simplified terms:

A becomes uppercase letter
a becomes lowercase letter
0 becomes digit

Other symbols match themselves.
This is a structure template and not a regex.

Path to a plain-text file containing additional pattern strings (one per
line) that Tesseract should accept as valid words. In the pattern language,
backslash-escaped sequences specify character classes:
`\d` = any digit;
`\c` = any letter;
`\a` = any lowercase letter;
`\A` = any uppercase letter;
`\n` = any alphanumeric character;
`\p` = any punctuation character.
All other characters match themselves. For example, `1-\d\d\d-GOOG-411`
matches a phone-number-like string.
These are structural templates, not regular expressions. For the full
pattern syntax, see `dict/trie.h` in the Tesseract source.

*user_patterns_suffix* (string, default: "") [Both]::
Filename suffix (relative to the tessdata directory) for a per-language file
of additional patterns.

LSTM ENGINE PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~

These parameters are only meaningful when using the LSTM OCR engine
(*--oem 1* or *--oem 2*).

*lstm_use_matrix* (bool, default: 1) [LSTM]::
Use the ratings matrix and beam search during LSTM decoding. Disabling this
reverts to a simpler, faster greedy decoding strategy that may be adequate
for very clean, high-quality images but generally gives lower accuracy.

*lstm_choice_mode* (int, default: 0) [LSTM]::
Enables alternative character hypotheses in hOCR output (requires
*tessedit_create_hocr* = 1):
0 = disabled (default);
1 = include per-timestep alternative choices;
2 = extract alternative choices from the CTC output mapped per character.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand.

See also *hocr_char_boxes* for character bounding-box output.

*lstm_choice_iterations* (int, default: 5) [LSTM]::
Number of cascading beam-search iterations used when *lstm_choice_mode* is
non-zero.

*lstm_rating_coefficient* (double, default: 5) [LSTM]::
Scaling factor applied to LSTM character ratings. Smaller values produce
higher (better) confidence scores and preserve more information before the
zero cut-off. The default value is 5.

LEGACY ENGINE PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

The following parameters apply only when using the legacy Tesseract engine
(*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy
model such as those from https://github.com/tesseract-ocr/tessdata).

*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]::
Apply bigram-based correction to improve recognition of adjacent word pairs
that commonly appear together (e.g. "is a", "in the", "New York"). The
correction uses a bigram dictionary from the traineddata file to re-score
word hypotheses in context.

*tessedit_enable_dict_correction* (bool, default: 0) [Legacy]::
Use the dictionary to post-correct uncertain word hypotheses.

*tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]::
Try to fix spaces that were ambiguously classified as inter-word or
inter-character gaps.

*language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]::
Penalty added to the score of word hypotheses that do not appear in the
dictionary. Increase to bias recognition more strongly towards dictionary
words.

*language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]::
Additional penalty for words that are in the dictionary but not in the list
of frequent words.

*language_model_penalty_case* (double, default: 0.1) [Legacy]::
Penalty applied when the capitalisation of a recognised word is inconsistent
with the surrounding context.

*language_model_penalty_script* (double, default: 0.5) [Legacy]::
Penalty applied when a recognised character belongs to a different script
from the surrounding text.

*language_model_penalty_punc* (double, default: 0.2) [Legacy]::
Penalty applied for punctuation usage that is inconsistent with the language
model.

*wordrec_enable_assoc* (bool, default: 1) [Legacy]::
Enable the associator, which considers combinations of character fragments
when forming word hypotheses. Disabling may speed up recognition at the
cost of accuracy on fragmented characters.

DEBUG PARAMETERS
~~~~~~~~~~~~~~~~

*debug_file* (string, default: "") [Both]::
Redirect Tesseract debug/diagnostic output to this file instead of stderr.
Set to `/dev/null` (or use the *quiet* config file) to suppress all debug
output.

*tessedit_write_params_to_file* (string, default: "") [Both]::
If set to a filename, Tesseract will write the values of all its parameters
to that file when it starts up. Useful for capturing the effective
configuration for debugging or reproducibility.


ENVIRONMENT VARIABLES
---------------------
*`TESSDATA_PREFIX`*::
Expand Down
Loading