Huggingface inserts special tokens, notably
<s>
</s>
[CLS]
[SEP]
[UNK]
and some special characters to indicate word boundaries like
If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased tokenizer which uses [SEP] does </s> right with
['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]
but roberrta-base finds
['<s>', '</s>', '</s>']
[0, 2, 2]
in which it confuses the text with the imaginary mark-up tokens. However, bert-base-cased will get confused with [SEP]
['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]
This is already problematic with tokenization via Python, but the Rust answers can differ. With roberta-base there is a Python result for <s>
['<s>', '<s>', '</s>']
[0, 0, 2]
and a Rust result
['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]
These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.
It looks like huggingface is using a single channel for the data and not escaping properly.
Huggingface inserts special tokens, notably
and some special characters to indicate word boundaries like
If these special tokens and characters are encountered in the text, they are often not handled correctly. The
bert-base-casedtokenizer which uses[SEP]does</s>right withbut
roberrta-basefindsin which it confuses the text with the imaginary mark-up tokens. However,
bert-base-casedwill get confused with[SEP]This is already problematic with tokenization via Python, but the Rust answers can differ. With
roberta-basethere is a Python result for<s>and a Rust result
These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.
It looks like huggingface is using a single channel for the data and not escaping properly.