I am trying to perform a task of approximation of two embeddings (textual and visual).
For the visual embedding, I am using VGG as the encoder. The output is a 1x1000
embedding. For the textual encoder, I am using a Transformer to which output is shaped 1x712
. What I want is to convert both these vectors to the same dimension 512.
img_features.shape, txt_features.shape = (1,1000),(1,712)
How can I do it in PyTorch? Add a final layer in each architecture that models the output to 512?
Advertisement
Answer
You could either apply a differentiable PCA operator such as
torch.pca_lowrank
.Alternatively, an easier solution is to use two fully connected adapter layers to learn two mappings. One for you image features
1000 -> n
, the other for textual features:712 -> n
. Then you can choose a fusion strategy to combine the two features shaped(1,n)
: either using concatenation, point-wise addition/multiplication (in thoses casesn
should be equal to512
. Esle you can learn a final mappingn*2 -> 512
.