I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results. Here is an example of a piece of text and the tokens that were created from it. And here is
Tag: huggingface-tokenizers
BERT get sentence embedding
I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time. Is there a way to expedite the process? Would sending batches of sentences rather than one sentence at a time
Can BERT output be fixed in shape, irrespective of string size?
I am confused about using huggingface BERT models and about how to make them yield a prediction at a fixed shape, regardless of input size (i.e., input string length). I tried to call the tokenizer with the parameters padding=True, truncation=True, max_length = 15, but the prediction output dimensions for inputs = [“a”, “a”*20, “a”*100, “abcede”*20000] are not fixed. What am
Hugging Face: NameError: name ‘sentences’ is not defined
I am following this tutorial here: https://huggingface.co/transformers/training.html – though, I am coming across an error, and I think the tutorial is missing an import, but i do not know which. These are my current imports: Current code: The error: Answer The error states that you do not have a variable called sentences in the scope. I believe the tutorial presumes
Transformers v4.x: Convert slow tokenizer to fast tokenizer
I’m following the transformer’s pretrained model xlm-roberta-large-xnli example and I get the following error I’m using Transformers version ‘4.1.1’ Answer According to Transformers v4.0.0 release, sentencepiece was removed as a required dependency. This means that “The tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation” including the XLMRobertaTokenizer. However, sentencepiece can be installed