I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results. Here is an example of a piece of text and the…
Tag: huggingface-tokenizers
BERT get sentence embedding
I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time. Is there a way to expedite the process? Would sending batches of sentences rather than o…
Can BERT output be fixed in shape, irrespective of string size?
I am confused about using huggingface BERT models and about how to make them yield a prediction at a fixed shape, regardless of input size (i.e., input string length). I tried to call the tokenizer with the parameters padding=True, truncation=True, max_length = 15, but the prediction output dimensions for inp…
Hugging Face: NameError: name ‘sentences’ is not defined
I am following this tutorial here: https://huggingface.co/transformers/training.html – though, I am coming across an error, and I think the tutorial is missing an import, but i do not know which. These are my current imports: Current code: The error: Answer The error states that you do not have a variab…
Transformers v4.x: Convert slow tokenizer to fast tokenizer
I’m following the transformer’s pretrained model xlm-roberta-large-xnli example and I get the following error I’m using Transformers version ‘4.1.1’ Answer According to Transformers v4.0.0 release, sentencepiece was removed as a required dependency. This means that “The tok…