Skip to content
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
271 changes: 271 additions & 0 deletions doc/tesseract.1.asc
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,277 @@ one per line. The format of the latter is documented in 'dict/trie.h'
on 'read_pattern_list()'.


[[PARAMETERS]]
PARAMETERS
----------

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth including textord_heavy_nr. It's a common one in noisy scans when you are cool with forsaking punctuation 😀

Tesseract parameters control the behaviour of the OCR engine and can be set
using the *-c* option (e.g. `-c tessedit_char_whitelist=0123456789`) or by
placing them in a <<CONFIGFILE,'CONFIGFILE'>>.
Run *--print-parameters* to list all available parameters with their current
values and short descriptions.

The engine column indicates which OCR engine modes support a parameter: +
*Both* -- works with both the LSTM and Legacy engines; +
*LSTM* -- only applies to the neural-network LSTM engine (OEM 1 or 2); +
*Legacy* -- only applies to the legacy Tesseract engine (OEM 0 or 2).

OUTPUT FORMAT PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

*tessedit_create_txt* (bool, default: 0) [Both]::
Write plain-text output to a `.txt` file. This is the default output format
when no config file or *-c* option overrides it.

*tessedit_create_hocr* (bool, default: 0) [Both]::
Write hOCR output to a `.hocr` file. hOCR is an HTML-based format that
encodes the OCR results together with their bounding boxes and confidences.
Use the *hocr* config file to enable this format.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a link here to the specification may be useful
https://kba.github.io/hocr-spec/1.2/


*hocr_font_info* (bool, default: 0) [Both]::
Include font information in hOCR output.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what information and how reliable is it? I'd love to know that before I enabled this.


*hocr_char_boxes* (bool, default: 0) [Both]::
Add per-character bounding-box coordinates to hOCR output.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge here might be out of date, but while using tesseract in the past I found that the char boxes only really worked on the legacy OCR.
So I created a quick python script to extract letter 'a' using LSTM and get this...

Image

and

Image

So I had 26 failures out of 131 or a 20% failure rate using the LSTM. Worth noting that this is still under development maybe?


*tessedit_create_alto* (bool, default: 0) [Both]::
Write ALTO XML output to a `.xml` file. ALTO (Analyzed Layout and Text
Object) is a standard XML schema for describing the layout and content of
pages. Use the *alto* config file to enable this format.

*tessedit_create_page_xml* (bool, default: 0) [Both]::
Write PAGE XML output to a `.page.xml` file. PAGE is a standard XML format
for ground truth and OCR results used in document image analysis competitions.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAGE XML, strange emphasis on competitions, it is a general purpose format used in

Digital humanities projects
Libraries and archives
Annotation tools (e.g. Transkribus, eScriptorium)

https://github.com/PRImA-Research-Lab/PAGE-XML

Use the *page* config file to enable this format.

*page_xml_polygon* (bool, default: 1) [Both]::
When writing PAGE XML output, create polygon outlines around text regions
instead of simple bounding boxes.

*page_xml_level* (int, default: 0) [Both]::
Granularity of PAGE XML output: 0 = line level, 1 = word level.

*tessedit_create_tsv* (bool, default: 0) [Both]::
Write tab-separated-values output to a `.tsv` file. Each recognized word is
output as one row with its bounding box, confidence and text. Use the *tsv*
config file to enable this format.

*tessedit_create_pdf* (bool, default: 0) [Both]::
Write a searchable PDF to a `.pdf` file. The PDF contains the original image
with an invisible text layer for copy-paste and searching. Use the *pdf*
config file to enable this format.

*textonly_pdf* (bool, default: 0) [Both]::
Write a text-only PDF (no image, only invisible text) to a `.pdf` file.

*tessedit_create_boxfile* (bool, default: 0) [Both]::
Write a Tesseract box file (`.box`) that lists each recognized character with
its bounding box. Used mainly for legacy engine training.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is used for legacy engine training should it be both? I'm not sure, but worth a thought.


*tessedit_create_wordstrbox* (bool, default: 0) [Both]::
Write a WordStr-format box file (`.box`). Similar to *tessedit_create_boxfile*
but records whole words instead of individual characters.

*tessedit_create_lstmbox* (bool, default: 0) [LSTM]::
Write an LSTM box file (`.box`) suitable for LSTM training.

*preserve_interword_spaces* (bool, default: 0) [Both]::
Preserve multiple consecutive inter-word spaces in the output instead of
collapsing them to a single space.

CHARACTER SET PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

*tessedit_char_whitelist* (string, default: "") [Both]::
Restrict the set of characters that Tesseract will recognize to only those
listed in this string. For example, setting this to `0123456789` will
make Tesseract return only digits. An empty value (the default) means all
characters in the trained data are allowed.

*tessedit_char_blacklist* (string, default: "") [Both]::
Prevent Tesseract from recognizing the characters listed in this string.
Blacklisted characters will never appear in the output. The blacklist is
applied after the whitelist.

*tessedit_char_unblacklist* (string, default: "") [Both]::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a legacy parameter? More about the originals multiple passes. Bit odd having an unblacklist but that could be me and my thought process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, please try to avoid "blacklist" and "whitelist" anywhere in the documentation as far as possible, but of course not for the name of parameters. Use more inclusive language.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 7e49f71. The descriptive text now uses "allowlist" and "exclusion list" instead of "whitelist"/"blacklist". All three parameter names (tessedit_char_whitelist, tessedit_char_blacklist, tessedit_char_unblacklist) are unchanged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth deprecating these terms and have the same functionality tied to

tessedit_chars_allowed
tessedit_chars_excluded
tessedit_chars_unexcluded

That way the terms still work so it's not a breaking change but the documentation surfaces better language. Feels like this would address your comment better.

Re-allow specific characters that were excluded by *tessedit_char_blacklist*.
Characters in this list override the blacklist.

IMAGE PROCESSING PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~~~~

*thresholding_method* (int, default: 0) [Both]::
Select the algorithm used to convert a greyscale image to binary before OCR:
0 = Otsu global thresholding (default);
1 = LeptonicaOtsu (tiled Otsu, better for uneven lighting);
2 = Sauvola local adaptive thresholding (best for heavily degraded documents).

*thresholding_window_size* (double, default: 0.33) [Both]::
Window size (multiplied by image DPI) used to compute local statistics for
the Sauvola thresholding method (*thresholding_method* = 2).

*thresholding_kfactor* (double, default: 0.34) [Both]::
Sensitivity factor for Sauvola thresholding (*thresholding_method* = 2).
Controls how much the local variance reduces the threshold. Typical range:
0.2 -- 0.5. Higher values produce more aggressive thresholding.

*thresholding_tile_size* (double, default: 0.33) [Both]::
Desired tile size (multiplied by image DPI) for the LeptonicaOtsu tiled
thresholding method (*thresholding_method* = 1).

*thresholding_smooth_kernel_size* (double, default: 0) [Both]::
Kernel size for smoothing the threshold array produced by LeptonicaOtsu
(*thresholding_method* = 1). Use 0 for no smoothing.

*thresholding_score_fraction* (double, default: 0.1) [Both]::
Fraction of the maximum Otsu score used by LeptonicaOtsu
(*thresholding_method* = 1). Use 0.0 for standard Otsu behaviour;
0.1 is recommended for better robustness.

*tessedit_do_invert* (bool, default: 1) [Both]::
Deprecated -- will be removed in a future release. When enabled, Tesseract
tries OCR on an inverted (white-on-black) copy of lines whose mean confidence
falls below *invert_threshold* and keeps the result with higher confidence.
To disable automatic inversion, set *invert_threshold* = 0 rather than
setting this parameter to 0.

*invert_threshold* (double, default: 0.7) [Both]::
Mean confidence threshold below which Tesseract will also attempt OCR on
the inverted image. Lower values make inversion less likely. Set to 0 to
disable automatic inversion entirely (preferred over setting
*tessedit_do_invert* = 0, which is deprecated).

*user_defined_dpi* (int, default: 0) [Both]::
Override the resolution of the input image in DPI. Use this when the image
metadata contains an incorrect or missing DPI value. A value of 0 means
the resolution is read from the image metadata or guessed automatically.
This parameter is equivalent to the *--dpi* command-line option; when *--dpi*
is given on the command line it simply sets this parameter.

DICTIONARY PARAMETERS
~~~~~~~~~~~~~~~~~~~~~

*load_system_dawg* (bool, default: 1) [Both]::
Load the main system word list (DAWG) from the traineddata file. Disabling
this can speed up recognition and may improve results when OCR-ing content
that does not resemble natural language (e.g. codes, identifiers).

*load_freq_dawg* (bool, default: 1) [Both]::
Load the list of frequent words from the traineddata file.

*load_unambig_dawg* (bool, default: 1) [Legacy]::
Load the list of unambiguous words from the traineddata file.

*load_punc_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing punctuation patterns from the traineddata file.

*load_number_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing number patterns from the traineddata file.

*load_bigram_dawg* (bool, default: 1) [Legacy]::
Load the dawg containing special word bigrams from the traineddata file.

*user_words_file* (string, default: "") [Both]::
Path to a plain-text file containing additional words (one per line) that
Tesseract should treat as valid dictionary words.

*user_words_suffix* (string, default: "") [Both]::
Filename suffix (relative to the tessdata directory) for a per-language file
of additional valid words. For example, setting this to `user-words` causes
Tesseract to look for `eng.user-words` when using the English model.

*user_patterns_file* (string, default: "") [Both]::
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you wish to beef this up a tiny bit...

Defined in dict/trie.h, but in simplified terms:

A becomes uppercase letter
a becomes lowercase letter
0 becomes digit

Other symbols match themselves.
This is a structure template and not a regex.

Path to a plain-text file containing additional pattern strings that
Tesseract should accept as valid words. See `dict/trie.h` for the pattern
format.

*user_patterns_suffix* (string, default: "") [Both]::
Filename suffix (relative to the tessdata directory) for a per-language file
of additional patterns.

LSTM ENGINE PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~

These parameters are only meaningful when using the LSTM OCR engine
(*--oem 1* or *--oem 2*).

*lstm_use_matrix* (bool, default: 1) [LSTM]::
Use the ratings matrix and beam search during LSTM decoding. Disabling this
reverts to a simpler greedy decoding strategy.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning it's a little faster turned off and required very clean text? Can't imagine a use case for me disabling this one.


*lstm_choice_mode* (int, default: 0) [LSTM]::
Enables alternative character hypotheses in hOCR output:
0 = disabled (default);
1 = include per-timestep alternative choices;
2 = extract alternative choices from the CTC output mapped per character.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a 'see' back up to the hocr character option as these tend to go hand in hand.


*lstm_choice_iterations* (int, default: 5) [LSTM]::
Number of cascading beam-search iterations used when *lstm_choice_mode* is
non-zero.

*lstm_rating_coefficient* (double, default: 5) [LSTM]::
Scaling factor applied to LSTM character ratings. Smaller values produce
higher (better) confidence scores and preserve more information before the
zero cut-off. The default value is 5.

LEGACY ENGINE PARAMETERS
~~~~~~~~~~~~~~~~~~~~~~~~

The following parameters apply only when using the legacy Tesseract engine
(*--oem 0* or *--oem 2*, requires a traineddata file that includes the legacy
model such as those from https://github.com/tesseract-ocr/tessdata).

*tessedit_enable_bigram_correction* (bool, default: 1) [Legacy]::
Apply bigram-based correction to improve recognition of adjacent words that
form common pairs.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IE 'is a' or 'lived at' etc. (For those not familiar with bigram corrections.)


*tessedit_enable_dict_correction* (bool, default: 0) [Legacy]::
Use the dictionary to post-correct uncertain word hypotheses.

*tessedit_fix_fuzzy_spaces* (bool, default: 1) [Legacy]::
Try to fix spaces that were ambiguously classified as inter-word or
inter-character gaps.

*language_model_penalty_non_dict_word* (double, default: 0.15) [Legacy]::
Penalty added to the score of word hypotheses that do not appear in the
dictionary. Increase to bias recognition more strongly towards dictionary
words.

*language_model_penalty_non_freq_dict_word* (double, default: 0.1) [Legacy]::
Additional penalty for words that are in the dictionary but not in the list
of frequent words.

*language_model_penalty_case* (double, default: 0.1) [Legacy]::
Penalty applied when the capitalisation of a recognised word is inconsistent
with the surrounding context.

*language_model_penalty_script* (double, default: 0.5) [Legacy]::
Penalty applied when a recognised character belongs to a different script
from the surrounding text.

*language_model_penalty_punc* (double, default: 0.2) [Legacy]::
Penalty applied for punctuation usage that is inconsistent with the language
model.

*wordrec_enable_assoc* (bool, default: 1) [Legacy]::
Enable the associator, which considers combinations of character fragments
when forming word hypotheses. Disabling may speed up recognition at the
cost of accuracy on fragmented characters.

DEBUG PARAMETERS
~~~~~~~~~~~~~~~~

*debug_file* (string, default: "") [Both]::
Redirect Tesseract debug/diagnostic output to this file instead of stderr.
Set to `/dev/null` (or use the *quiet* config file) to suppress all debug
output.

*tessedit_write_params_to_file* (string, default: "") [Both]::
If set to a filename, Tesseract will write the values of all its parameters
to that file when it starts up. Useful for capturing the effective
configuration for debugging or reproducibility.


ENVIRONMENT VARIABLES
---------------------
*`TESSDATA_PREFIX`*::
Expand Down
Loading