pandas out of memory error after variable assignment

Question

I have a very large pandas data frame and want to sample rows from it for modeling, and I encountered out of memory errors like this: MemoryError: Unable to allocate 6.59 GiB for an array with shape (40, 22117797) and data type float64 This error is weired since I don't need allocate such large amount of memory since my sampled

Accepted Answer

I found that the error is due to the consolidation operation pandas performed. Specifically, after the variable assignment using &#8220;d0[&#8216;r_%s&#8217;%(t)] = d0[0]&#8221;, d0 is stored in 13 blocks, i.e.: 13 contiguous memory space, and this could be checked using commandd0._data.nblocksOnce I run the command &#8220;d0s = d0.iloc[id1,:]&#8221;, pandas tries to consolidate the 13 blocks into 1 block, so this new version d0 with 1-block will take many memory space and my ram are used up. So the out-of-memory error pops up. The one-block consolidation could be checked using a smaller array as below:import pandas as pdimport numpy as npd0 = pd.DataFrame(np.random.rand(22117797, 12))for t in range(10):     d0['r_%s'%(t)] = d0[0]d0._data.nblocksid1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01))d0s = d0.iloc[id1,:]d0._data.nblocksTo solve this problem, I used a another way to take the values without triggering the consolidation operations, like below:d0s = pd.concat([d0.iloc[id1,col] for col in d0.columns], axis = 1)Hope this will help others encountering similar problems.

Advertisement

Answer