I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.
I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time.
- Is there a way to expedite the process?
- Would sending batches of sentences rather than one sentence at a time help?
.
#!pip install transformers import torch import transformers from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, # Whether the model returns all hidden-states. ) # Put the model in "evaluation" mode, meaning feed-forward operation. model.eval() corpa=["i am a boy","i live in a city"] storage=[]#list to store all embeddings for text in corpa: # Add the special tokens. marked_text = "[CLS] " + text + " [SEP]" # Split the sentence into tokens. tokenized_text = tokenizer.tokenize(marked_text) # Map the token strings to their vocabulary indeces. indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) segments_ids = [1] * len(tokenized_text) tokens_tensor = torch.tensor([indexed_tokens]) segments_tensors = torch.tensor([segments_ids]) # Run the text through BERT, and collect all of the hidden states produced # from all 12 layers. with torch.no_grad(): outputs = model(tokens_tensor, segments_tensors) # Evaluating the model will return a different number of objects based on # how it's configured in the `from_pretrained` call earlier. In this case, # becase we set `output_hidden_states = True`, the third item will be the # hidden states from all layers. See the documentation for more details: # https://huggingface.co/transformers/model_doc/bert.html#bertmodel hidden_states = outputs[2] # `hidden_states` has shape [13 x 1 x 22 x 768] # `token_vecs` is a tensor with shape [22 x 768] token_vecs = hidden_states[-2][0] # Calculate the average of all 22 token vectors. sentence_embedding = torch.mean(token_vecs, dim=0) storage.append((text,sentence_embedding))
######update 1
I modified my code based upon the answer provided. It is not doing full batch processing
#!pip install transformers import torch import transformers from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, # Whether the model returns all hidden-states. ) # Put the model in "evaluation" mode, meaning feed-forward operation. model.eval() batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"] encoded_inputs = tokenizer(batch_sentences) storage=[]#list to store all embeddings for i,text in enumerate(encoded_inputs['input_ids']): tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]]) segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]]) print (tokens_tensor) print (segments_tensors) # Run the text through BERT, and collect all of the hidden states produced # from all 12 layers. with torch.no_grad(): outputs = model(tokens_tensor, segments_tensors) # Evaluating the model will return a different number of objects based on # how it's configured in the `from_pretrained` call earlier. In this case, # becase we set `output_hidden_states = True`, the third item will be the # hidden states from all layers. See the documentation for more details: # https://huggingface.co/transformers/model_doc/bert.html#bertmodel hidden_states = outputs[2] # `hidden_states` has shape [13 x 1 x 22 x 768] # `token_vecs` is a tensor with shape [22 x 768] token_vecs = hidden_states[-2][0] # Calculate the average of all 22 token vectors. sentence_embedding = torch.mean(token_vecs, dim=0) print (sentence_embedding[:10]) storage.append((text,sentence_embedding))
I could update first 2 lines from the for loop to below. But they work only if all sentences have same length after tokenization
tokens_tensor = torch.tensor([encoded_inputs['input_ids']]) segments_tensors = torch.tensor([encoded_inputs['attention_mask']])
moreover in that case outputs = model(tokens_tensor, segments_tensors)
fails.
How could I fully perform batch processing in such case?
Advertisement
Answer
About your original question: there is not much you can do. BERT is pretty computationally demanding algorithm. Your best shot is to use BertTokenizerFast
instead of the regular BertTokenizer
. The “fast” version is much more efficient and you will see the difference for large amounts of text.
Saying that, I have to warn you that averaging BERT word embeddings does not create good embeddings for the sentence. See this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models.