transformers AutoTokenizer.tokenize introducing extra characters

Question

I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results. Here is an example of a piece of text and the tokens that were created from it. And here is

Accepted Answer

This is not an error but a feature. BERT and other transformers use WordPiece tokenization algorithm that tokenizes strings into either: (1) known words; or (2) &#8220;word pieces&#8221; for unknown words in the tokenizer vocabulary.In your examle, words &#8220;CTO&#8221;, &#8220;TLR&#8221;, and &#8220;Pty&#8221; are not in the tokenizer vocabulary, and thus WordPiece splits them into subwords. E.g. the first subword is &#8220;CT&#8221; and another part is &#8220;##O&#8221; where &#8220;##&#8221; denotes that the subword is connected to the predecessor.This is a great feature that allows to represent any string.

Advertisement

Answer