Pytorch’s nn.TransformerEncoder “src_key_padding_mask” not functioning as expected

Im working with Pytorch’s nn.TransformerEncoder module. I got input samples with (as normal) the shape (batch-size, seq-len, emb-dim). All samples in one batch have been zero-padded to the size of the biggest sample in this batch. Therefore I want the attention of the all zero values to be ignored.

The documentation says, to add an argument src_key_padding_mask to the forward function of the nn.TransformerEncoder module. This mask should be a tensor with shape (batch-size, seq-len) and have for each index either True for the pad-zeros or False for anything else.

I achieved that by doing:

. . .

def forward(self, x):
    # x.size -> i.e.: (200, 28, 200)

    mask = (x == 0).cuda().reshape(x.shape[0], x.shape[1])
    # mask.size -> i.e.: (200, 20)

    x = self.embed(x.type(torch.LongTensor).to(device=device))
    x = self.pe(x)

    x = self.transformer_encoder(x, src_key_padding_mask=mask)

    . . .

Everything works good when I dont set the src_key_padding_mask. But the error I get when I do is the following:

File "/home/me/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 4282, in multi_head_attention_forward
    assert key_padding_mask.size(0) == bsz
AssertionError

Seems seems like it is comparing the first dimension of the mask, which is the batch-size, with bsz which probably stands for batch-size. But why is it failing then? Help very much appreciated!

Answer

I got the same issue, which is not a bug: pytorch’s Transformer implementation requires the input x to be (seq-len x batch-size x emb-dim) while yours seems to be (batch-size x seq-len x emb-dim).

Advertisement

Answer