Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions bindings/python/src/trainers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,12 @@ impl PyBpeTrainer {

/// Trainer capable of training a WordPiece model
///
/// Note:
/// ``Tokenizer.train_new_from_iterator()`` always uses the BPE trainer
/// internally, even when the underlying model is WordPiece. To train a
/// true WordPiece tokenizer, use ``WordPieceTrainer`` with
/// ``tokenizer.train(...)`` directly, as shown in the example below.
///
/// Args:
/// vocab_size (:obj:`int`, `optional`):
/// The size of the final vocabulary, including all tokens and alphabet.
Expand Down
8 changes: 8 additions & 0 deletions docs/source-doc-builder/components.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,14 @@ they are the only mandatory component of a Tokenizer.
| WordPiece | This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous `##` prefix to identify tokens that are part of a word (ie not starting a word). |
| Unigram | Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one. |

<Tip warning={true}>

`Tokenizer.train_new_from_iterator()` always uses the BPE trainer internally,
even when the underlying model is WordPiece. To train a true WordPiece
tokenizer, use `WordPieceTrainer` with `tokenizer.train(...)` directly.

</Tip>

## Post-Processors

After the whole pipeline, we sometimes want to insert some special
Expand Down