I am running a very simple code to read into txt files and add them to an existing dictionary. With htop I see that the used memory linearly increases until I run out of memory. Here is a simplified version of the code:
import numpy as np data = np.load(path_dictionary, allow_pickle=True) dic = data.item() for ids in dic: output = np.loadtxt(filename) array = output[:,1] dic[ids][new_info] = array
I tried to delete the output and added a garbage collector in the loop and it has not helped.
del output del array gc.collect()
I used a function from this post to get the size of the dictionary before and after 100 iterations. The The original dictionary is 9GB and the size increases by about 13MB, while from htop the used memory increases by 10GB. The script is supposed to read into around 70K files.
Can someone help me with what is causing the memory leak, and possible solutions for it?
Advertisement
Answer
When you call array = output[:,1]
the numpy just creates a view. Meaning that it keeps a reference to a whole (presumably large) output
and the information that array
is just first column. Now you save this reference to the dic
meaning there still exists a reference to a whole output and garbage collector cannot free the memory.
To work around this issue just instruct numpy that it should create a copy:
array = output[:,1].copy()
That way array will contain its own copy of the data (which is slower that creating the view), but the point is that once you delete the output
(either explicitly via del output
or override it in the next iteration), there is no more references to the output and the memory will be freed.