Even after rebooting the machine, there is >95% of GPU Memory used by python3
process (system-wide interpreter).
Note that memory consumption keeps even if there are no running training scripts, and I’ve never used keras/tensorflow
in the system environment, only with venv
or in docker container.
UPDATED: The last activity was the execution of NN test script with the following configurations:
tensorflow==1.14.0
Keras==2.0.3
tf.autograph.set_verbosity(1) tf.set_random_seed(1) session_conf = tf.ConfigProto(intra_op_parallelism_threads=8, inter_op_parallelism_threads=8) session_conf.gpu_options.allow_growth = True sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) K.set_session(sess)
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.26 Driver Version: 440.26 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 105... Off | 00000000:01:00.0 Off | N/A | | N/A 53C P3 N/A / N/A | 3981MiB / 4042MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4105 G /usr/lib/xorg/Xorg 145MiB | | 0 4762 C /usr/bin/python3 3631MiB | | 0 4764 G /usr/bin/gnome-shell 88MiB | | 0 5344 G ...quest-channel-token=8947774662807822104 61MiB | | 0 6470 G ...Charm-P/ch-0/191.6605.12/jre64/bin/java 5MiB | | 0 7200 C python 45MiB | +-----------------------------------------------------------------------------+
After rebooting in recovery mode, I’ve tried to run nvidia-smi -r
but It didn’t solve the issue.
Advertisement
Answer
By default Tf allocates GPU memory for the lifetime of a process, not the lifetime of the session object (so memory can linger much longer than the object). That is why memory is lingering after you stop the program. In a lot of cases, using the gpu_options.allow_growth = True
parameter is flexible, but it will allocate as much GPU memory needed as the runtime process requires.
To prevent tf.Session
from using all of your GPU memory, you can allocate a fixed amount of memory for the total process by changing your gpu_options.allow_growth = True
to allow for a defined memory fraction (let’s use 50% since your program seems to be able to use a lot of memory) at runtime like:
session_conf.gpu_options.per_process_gpu_memory_fraction = 0.5
This should stop you from reaching the upper limit and cap at ~2GB (since it looks like you have 4GB of GPU).