Masking layer vs attention_mask parameter in MultiHeadAttention

Question

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both:

Accepted Answer

The Tensoflow documentation on Masking and padding with keras may be helpful.The following is an excerpt from the document.When using the Functional API or the Sequential API, a mask generatedby an Embedding or Masking layer will be propagated through thenetwork for any layer that is capable of using them (for example, RNNlayers). Keras will automatically fetch the mask corresponding to aninput and pass it to any layer that knows how to use it.tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.Improved masking support for tf.keras.layers.MultiHeadAttention.Implicit masks for query, key and value inputs will automatically beused to compute a correct attention mask for the layer. These paddingmasks will be combined with any attention_mask passed in directly whencalling the layer. This can be used with tf.keras.layers.Embeddingwith mask_zero=True to automatically infer a correct padding mask.Added a use_causal_mask call time arugment to the layer. Passinguse_causal_mask=True will compute a causal attention mask, andoptionally combine it with any attention_mask passed in directly whencalling the layer.

Advertisement

Answer