Im working with Pytorch’s nn.TransformerEncoder
module. I got input samples with (as normal) the shape (batch-size, seq-len, emb-dim
). All samples in one batch have been zero-padded to the size of the biggest sample in this batch. Therefore I want the attention of the all zero values to be ignored.
The documentation says, to add an argument src_key_padding_mask
to the forward
function of the nn.TransformerEncoder
module. This mask should be a tensor with shape (batch-size, seq-len
) and have for each index either True
for the pad-zeros or False
for anything else.
I achieved that by doing:
. . . def forward(self, x): # x.size -> i.e.: (200, 28, 200) mask = (x == 0).cuda().reshape(x.shape[0], x.shape[1]) # mask.size -> i.e.: (200, 20) x = self.embed(x.type(torch.LongTensor).to(device=device)) x = self.pe(x) x = self.transformer_encoder(x, src_key_padding_mask=mask) . . .
Everything works good when I dont set the src_key_padding_mask
. But the error I get when I do is the following:
File "/home/me/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 4282, in multi_head_attention_forward assert key_padding_mask.size(0) == bsz AssertionError
Seems seems like it is comparing the first dimension of the mask, which is the batch-size, with bsz
which probably stands for batch-size. But why is it failing then? Help very much appreciated!
Advertisement
Answer
I got the same issue, which is not a bug: pytorch’s Transformer implementation requires the input x
to be (seq-len x batch-size x emb-dim)
while yours seems to be (batch-size x seq-len x emb-dim)
.