Skip to content

Special tokens in text aren't escaped properly to huggingface #33

Description

@kwalcock

Huggingface inserts special tokens, notably

<s>
</s>
[CLS]
[SEP]
[UNK]

and some special characters to indicate word boundaries like

▁
Ġ

If these special tokens and characters are encountered in the text, they are often not handled correctly. The bert-base-cased tokenizer which uses [SEP] does </s> right with

['[CLS]', '<', '/', 's', '>', '[SEP]']
[101, 133, 120, 188, 135, 102]

but roberrta-base finds

['<s>', '</s>', '</s>']
[0, 2, 2]

in which it confuses the text with the imaginary mark-up tokens. However, bert-base-cased will get confused with [SEP]

['[CLS]', '[SEP]', '[SEP]']
[101, 102, 102]

This is already problematic with tokenization via Python, but the Rust answers can differ. With roberta-base there is a Python result for <s>

['<s>', '<s>', '</s>']
[0, 0, 2]

and a Rust result

['<s>', 'Ġ', '<s>', '</s>']
[0, 1437, 0, 2]

These probably won't have a measurable effect, but there would be reproducibility problems if Python and Rust results were compared byte by byte.

It looks like huggingface is using a single channel for the data and not escaping properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions