How to input embeddings directly to a huggingface model instead of tokens?

I’m going over the huggingface tutorial where they showed how tokens can be fed into a model to generate hidden representations:

import torch
from transformers import RobertaTokenizer
from transformers import RobertaModel

checkpoint = 'roberta-base'
tokenizer = RobertaTokenizer.from_pretrained(checkpoint)
model = RobertaModel.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life."]

tokens = tokenizer(sequences, padding=True)
out = model(torch.tensor(tokens['input_ids']))
out.last_hidden_state

JavaScript
​x
 
import torch
from transformers import RobertaTokenizer
from transformers import RobertaModel
​
checkpoint = 'roberta-base'
tokenizer = RobertaTokenizer.from_pretrained(checkpoint)
model = RobertaModel.from_pretrained(checkpoint)
​
sequences = ["I've been waiting for a HuggingFace course my whole life."]
​
tokens = tokenizer(sequences, padding=True)
out = model(torch.tensor(tokens['input_ids']))
out.last_hidden_state
​

But how can I input word embeddings directly instead of tokens? That is, I have another model that generates word embeddings and I need to feed those into the model

Answer

Most (every?) huggingface encoder model supports that with the parameter inputs_embeds:

import torch
from transformers import RobertaModel

m = RobertaModel.from_pretrained("roberta-base")

my_input = torch.rand(2,5,768)

outputs = m(inputs_embeds=my_input)

JavaScript
 
import torch
from transformers import RobertaModel
​
m = RobertaModel.from_pretrained("roberta-base")
​
my_input = torch.rand(2,5,768)
​
outputs = m(inputs_embeds=my_input)
​

P.S.: Don’t forget the attention mask in case this is required.

Advertisement

Answer