Skip to content
Advertisement

What are the main reasons why some network parameters might become nan after calling optimizer.step in Pytorch?

I am trying to understand why one or two parameters in my Pytorch neural network occasionally become nan after calling optimizer.step().

I have already checked the gradients after calling .backward() and just before calling the optimizer, and they neither contain nans nor are very large. I am doing gradient clipping, but I don’t think that this can be responsible since the gradients still look fine after clipping. I am using single-precision floats everywhere.

This behavior happens randomly every hundred thousand epochs or so, and is proving very difficult to debug. Unfortunately the code is too long to reproduce here and I haven’t been able to replicate the problem in a smaller example.

If anyone can suggest possible issues I haven’t mentioned above, that would be super helpful.

Thanks!

Advertisement

Answer

This ended up being ignorance on my part: there were Infs in the gradients that were evading my diagnostic code, as I didn’t realize Pytorch’s .isnan() method doesn’t detect them.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement