Skip to content
Advertisement

pandas out of memory error after variable assignment

I have a very large pandas data frame and want to sample rows from it for modeling, and I encountered out of memory errors like this:

MemoryError: Unable to allocate 6.59 GiB for an array with shape (40, 22117797) and data type float64

This error is weired since I don’t need allocate such large amount of memory since my sampled dataframe is only 1% of the original data. Below is my code.

Specifically, the original data has 20 million of rows and most of them are np.float64 data. After loading the data from parquet file using pyarrow, the jupyter kernel takes about 3 GB memory. After the variable assignment using “d0[‘r_%s’%(t)] = d0.col0”, the kernel takes 6 GB. However, once I run the sampling command “d0s = d0.iloc[id1,:]”, the memory goes up to 13 GB and the program stops due to out-of-memory error above.

Below code is a minimal working example to reproduce the error on a 16GB memory machine using pandas 1.2.3.

JavaScript

Note the the following code won’t generate error if I directly generate a big dataframe:

JavaScript

Advertisement

Answer

I found that the error is due to the consolidation operation pandas performed. Specifically, after the variable assignment using “d0[‘r_%s’%(t)] = d0[0]”, d0 is stored in 13 blocks, i.e.: 13 contiguous memory space, and this could be checked using command

JavaScript

Once I run the command “d0s = d0.iloc[id1,:]”, pandas tries to consolidate the 13 blocks into 1 block, so this new version d0 with 1-block will take many memory space and my ram are used up. So the out-of-memory error pops up. The one-block consolidation could be checked using a smaller array as below:

JavaScript

To solve this problem, I used a another way to take the values without triggering the consolidation operations, like below:

JavaScript

Hope this will help others encountering similar problems.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement