Skip to content
Advertisement

CUDA Error: out of memory – Python process utilizes all GPU memory

Even after rebooting the machine, there is >95% of GPU Memory used by python3 process (system-wide interpreter). Note that memory consumption keeps even if there are no running training scripts, and I’ve never used keras/tensorflow in the system environment, only with venv or in docker container.

UPDATED: The last activity was the execution of NN test script with the following configurations:

tensorflow==1.14.0 Keras==2.0.3

tf.autograph.set_verbosity(1)
tf.set_random_seed(1)

session_conf = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=8)
session_conf.gpu_options.allow_growth = True
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26       Driver Version: 440.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3    N/A /  N/A |   3981MiB /  4042MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4105      G   /usr/lib/xorg/Xorg                           145MiB |
|    0      4762      C   /usr/bin/python3                            3631MiB |
|    0      4764      G   /usr/bin/gnome-shell                          88MiB |
|    0      5344      G   ...quest-channel-token=8947774662807822104    61MiB |
|    0      6470      G   ...Charm-P/ch-0/191.6605.12/jre64/bin/java     5MiB |
|    0      7200      C   python                                        45MiB |
+-----------------------------------------------------------------------------+


After rebooting in recovery mode, I’ve tried to run nvidia-smi -r but It didn’t solve the issue.

Advertisement

Answer

By default Tf allocates GPU memory for the lifetime of a process, not the lifetime of the session object (so memory can linger much longer than the object). That is why memory is lingering after you stop the program. In a lot of cases, using the gpu_options.allow_growth = True parameter is flexible, but it will allocate as much GPU memory needed as the runtime process requires.

To prevent tf.Session from using all of your GPU memory, you can allocate a fixed amount of memory for the total process by changing your gpu_options.allow_growth = True to allow for a defined memory fraction (let’s use 50% since your program seems to be able to use a lot of memory) at runtime like:

session_conf.gpu_options.per_process_gpu_memory_fraction = 0.5

This should stop you from reaching the upper limit and cap at ~2GB (since it looks like you have 4GB of GPU).

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement