Skip to content
Advertisement

transformers AutoTokenizer.tokenize introducing extra characters

I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results.

Here is an example of a piece of text and the tokens that were created from it.

CTO at TLR Communications Pty Ltd
['[CLS]', 'CT', '##O', 'at', 'T', '##LR', 'Communications', 'P', '##ty', 'Ltd', '[SEP]']

And here is the code I am using to generate the tokens

tokenizer = AutoTokenizer.from_pretrained("tokenizer_bert.json")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))

Advertisement

Answer

This is not an error but a feature. BERT and other transformers use WordPiece tokenization algorithm that tokenizes strings into either: (1) known words; or (2) “word pieces” for unknown words in the tokenizer vocabulary.

In your examle, words “CTO”, “TLR”, and “Pty” are not in the tokenizer vocabulary, and thus WordPiece splits them into subwords. E.g. the first subword is “CT” and another part is “##O” where “##” denotes that the subword is connected to the predecessor.

This is a great feature that allows to represent any string.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement