MultiHeadAttention giving very different values between versions (Pytorch/Tensorflow

Question

I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output. I'm converting

Accepted Answer

nn.MultiheadAttention outputs by default tuple with two tensors:attn_output &#8212; result of self-attention operationattn_output_weights &#8212; attention weights averaged(!) over headsAt the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:output, scores = self_attn(x, x, x, return_attention_scores=True)Tensor scores also should be averaged to achieve full correspondence with pytorch:scores = tf.math.reduce_mean(scores, 1)While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).

Advertisement

Answer