Skip to content
Advertisement

Masking layer vs attention_mask parameter in MultiHeadAttention

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?

Advertisement

Answer

The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.

When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.

tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.

Improved masking support for tf.keras.layers.MultiHeadAttention.

  • Implicit masks for query, key and value inputs will automatically be used to compute a correct attention mask for the layer. These padding masks will be combined with any attention_mask passed in directly when calling the layer. This can be used with tf.keras.layers.Embedding with mask_zero=True to automatically infer a correct padding mask.
  • Added a use_causal_mask call time arugment to the layer. Passing use_causal_mask=True will compute a causal attention mask, and optionally combine it with any attention_mask passed in directly when calling the layer.
Advertisement