Gradient exploding problem in a graph neural network

Question

I have a gradient exploding problem which I couldn't solve after trying for several days. I implemented a custom message passing graph neural network in TensorFlow which is used to predict a continuous value from graph data. Each graph is associated with one target value. Each node of a graph is represented by a node attribute vector, and the edges

Accepted Answer

I solved the problem thanks to this cool debugging tool tf.debugging.check_numerics.I initially identified concatenating e was the problem, and then realised the values that get passed onto e are considerably larger than the values in neighbors_mean which is concatenated with e. Once they are concatenated and sent through a neural network (Net() in my code), I observed some outputs in order of hundreds and slowly reaching thousands as the training progresses.This is problematic as I have a softmax operation within the message passing layer. Note that softmax calculates an exponential (exi/Σexj). Anything above e709 results in a numerical overflow in Python. This was producing inf values and eventually everything becoming nan was the problem in my code. So, this is technically not a gradient exploding problem which is why it couldn&#8217;t be solved with gradient clipping.How did I track the issue?I put tf.debugging.check_numerics() snippets under several layers/tensors I thought were producing nan values. Something like this:tf.debugging.check_numerics(layerN, "LayerN is producing nans!")This produces an InvalidArgumentError as soon as the layer outputs become inf or nan during training.Traceback (most recent call last):  File "trainer.py", line 506, in <module>    worker.train_model()  File "trainer.py", line 211, in train_model    l, tmae = train_step(*batch)  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__    result = self._call(*args, **kwds)  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat    ctx, args, cancellation_manager=cancellation_manager))  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call    ctx=ctx)  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute    inputs, attrs, num_outputs)tensorflow.python.framework.errors_impl.InvalidArgumentError:  LayerN is producing nans! : Tensor had NaN valuesNow we know where the problem is.How to solve the issueI applied kernel constraints to the neural network weights whose output gets passed onto the softmax function.layers.Dense(x, name="layer1", kernel_regularizer=regularizers.l2(1e-6), kernel_constraint=min_max_norm(min_value=1e-30, max_value=1.0))This should make sure that all weights are less than 1 and the layer does not produce large outputs. This resolved the problem without degrading the performance.Alternatively, one could use the numerically stable implementation of the softmax function.

Advertisement

Answer