I am confused about using huggingface BERT models and about how to make them yield a prediction at a fixed shape, regardless of input size (i.e., input string length).
I tried to call the tokenizer with the parameters padding=True, truncation=True, max_length = 15
, but the prediction output dimensions for inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
are not fixed. What am I missing here?
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") inputs = ["a", "a"*20, "a"*100, "abcede"*20000] for input in inputs: inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt") outputs = model(**inputs) print(outputs.last_hidden_state.shape, input, len(input))
output:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). torch.Size([1, 3, 768]) a 1 torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20 torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100 torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000
Advertisement
Answer
When you call the tokenizer with only one sentence and padding=True
, truncation=True
, max_length = 15
, it will pad the output sequence to the longest input sequence and truncate if required. Since you are providing only one sentence, the tokenizer can not pad anything because it is already the longest sequence of the batch. That means you can achieve what you want in two ways:
- Provide a batch:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") inputs = ["a", "a"*20, "a"*100, "abcede"*200] inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt") print(inputs["input_ids"]) outputs = model(**inputs) print(outputs.last_hidden_state.shape)
Output:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 2050, 102, 0, 0, 0], [ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 102], [ 101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) torch.Size([4, 15, 768])
- Set
padding="max_length"
:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") inputs = ["a", "a"*20, "a"*100, "abcede"*200] for i in inputs: inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt") print(inputs["input_ids"]) outputs = model(**inputs) print(outputs.last_hidden_state.shape, i, len(i))
Output:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) torch.Size([1, 15, 768]) a 1 tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 2050, 102, 0, 0, 0]]) torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20 tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 102]]) torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100 tensor([[101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200