Skip to content
Advertisement

Can BERT output be fixed in shape, irrespective of string size?

I am confused about using huggingface BERT models and about how to make them yield a prediction at a fixed shape, regardless of input size (i.e., input string length).

I tried to call the tokenizer with the parameters padding=True, truncation=True, max_length = 15, but the prediction output dimensions for inputs = ["a", "a"*20, "a"*100, "abcede"*20000] are not fixed. What am I missing here?

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
  inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape, input, len(input))

output:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000

Advertisement

Answer

When you call the tokenizer with only one sentence and padding=True, truncation=True, max_length = 15, it will pad the output sequence to the longest input sequence and truncate if required. Since you are providing only one sentence, the tokenizer can not pad anything because it is already the longest sequence of the batch. That means you can achieve what you want in two ways:

  1. Provide a batch:
from transformers import AutoTokenizer, AutoModel
 
  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
  model = AutoModel.from_pretrained("bert-base-uncased")
   
  inputs = ["a", "a"*20, "a"*100, "abcede"*200]
  inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
  print(inputs["input_ids"])
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape)

Output:

tensor([[  101,  1037,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
          2050,   102,     0,     0,     0],
        [  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
         11057, 11057, 11057, 11057,   102],
        [  101,   100,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]])
torch.Size([4, 15, 768])
  1. Set padding="max_length":
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
  inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
  print(inputs["input_ids"])
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape, i, len(i))

Output:

tensor([[ 101, 1037,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]])
torch.Size([1, 15, 768]) a 1
tensor([[  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
          2050,   102,     0,     0,     0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
         11057, 11057, 11057, 11057,   102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement