Skip to content
Advertisement

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.

I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time.

  1. Is there a way to expedite the process?
  2. Would sending batches of sentences rather than one sentence at a time help?

.

JavaScript

######update 1

I modified my code based upon the answer provided. It is not doing full batch processing

JavaScript

I could update first 2 lines from the for loop to below. But they work only if all sentences have same length after tokenization

JavaScript

moreover in that case outputs = model(tokens_tensor, segments_tensors) fails.

How could I fully perform batch processing in such case?

Advertisement

Answer

About your original question: there is not much you can do. BERT is pretty computationally demanding algorithm. Your best shot is to use BertTokenizerFast instead of the regular BertTokenizer. The “fast” version is much more efficient and you will see the difference for large amounts of text.

Saying that, I have to warn you that averaging BERT word embeddings does not create good embeddings for the sentence. See this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement