Skip to content
Advertisement

CuDNN crash in TF 2.x after many epochs of training

I’m currently becoming more and more desperate concerning my tensorflow project. It took many hours installing tensorflow until I figured out that PyCharm, Python 3.7 and TF 2.x are somehow not compatible. Now it is running, but I get a really unspecific CuDNN error after many epochs of training. Do you know if my code is wrong or if there is e.g. an installation error? Could you please hint me a direction? I also didn’t find anything specific with searching.

My setup [in brackets what I also tried]:

  • HW: i7-4790K, 32 GB RAM and GeForce 2070 Super 8GB
  • OS: Windows 10 64bit
  • Python: 3.6.8 [and 3.7 (where tf failed to install)]
  • IDE: PyCharm 2020.1.1 [and 2020.1]
  • Driver: Latest “Studio” driver 442.92 [and also latest “gaming” driver]
  • CuDA: 10.1 + latest CuDNN dlls for this version [I also tried 10.2, but tf doesn’t detect it]
  • TF: 2.2.0 RC4 [, 2.0.x and 2.1.5] All packages installed via PyCharm (and therefore pip)

This error occurs after ~3h of training. In other cases (or parametrisations of the net) the error occurs much earlier. Here you can see the full output of the code sniplet below:

JavaScript

Here is some code, which should be able to ran and produced the above output:

JavaScript

Advertisement

Answer

For those who come after me:

I played a lot around with different versions. I even tried to get CUDA 10.2 to work by symlinking the new dlls with the old names. But even this did not fix the bug.

I finally managed to get it to work, by removing all NVidia stuff (including drivers) and installing the newest 10.1 release (from end of ’19) with the studio drivers from this release. So Version 431.86, instead of the latest studio release 441.66.

I don’t think that the previos installations had an error, therefore my estimate is that the driver version was the problem all the time…

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement