I have a very large pandas data frame and want to sample rows from it for modeling, and I encountered out of memory errors like this:
MemoryError: Unable to allocate 6.59 GiB for an array with shape (40, 22117797) and data type float64
This error is weired since I don’t need allocate such large amount of memory since my sampled dataframe is only 1% of the original data. Below is my code.
Specifically, the original data has 20 million of rows and most of them are np.float64 data. After loading the data from parquet file using pyarrow, the jupyter kernel takes about 3 GB memory. After the variable assignment using “d0[‘r_%s’%(t)] = d0.col0”, the kernel takes 6 GB. However, once I run the sampling command “d0s = d0.iloc[id1,:]”, the memory goes up to 13 GB and the program stops due to out-of-memory error above.
Below code is a minimal working example to reproduce the error on a 16GB memory machine using pandas 1.2.3.
import pandas as pd import numpy as np d0 = pd.DataFrame(np.random.rand(22117797, 12)) for t in range(30): d0['r_%s'%(t)] = d0[0] id1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01)) d0s = d0.iloc[id1,:]
Note the the following code won’t generate error if I directly generate a big dataframe:
import pandas as pd import numpy as np d0 = pd.DataFrame(np.random.rand(22117797, 42)) id1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01)) d0s = d0.iloc[id1,:]
Advertisement
Answer
I found that the error is due to the consolidation operation pandas performed. Specifically, after the variable assignment using “d0[‘r_%s’%(t)] = d0[0]”, d0 is stored in 13 blocks, i.e.: 13 contiguous memory space, and this could be checked using command
d0._data.nblocks
Once I run the command “d0s = d0.iloc[id1,:]”, pandas tries to consolidate the 13 blocks into 1 block, so this new version d0 with 1-block will take many memory space and my ram are used up. So the out-of-memory error pops up. The one-block consolidation could be checked using a smaller array as below:
import pandas as pd import numpy as np d0 = pd.DataFrame(np.random.rand(22117797, 12)) for t in range(10): d0['r_%s'%(t)] = d0[0] d0._data.nblocks id1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01)) d0s = d0.iloc[id1,:] d0._data.nblocks
To solve this problem, I used a another way to take the values without triggering the consolidation operations, like below:
d0s = pd.concat([d0.iloc[id1,col] for col in d0.columns], axis = 1)
Hope this will help others encountering similar problems.