Skip to content
Advertisement

How to create a pivot table on extremely large dataframes in Pandas

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I’ve tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby(‘someKey’).sum(), all my memory is taken up and python eventually crashes.

How can I do a pivot on data this large with a limited ammount of RAM?

EDIT: adding sample code

The following code includes various test outputs along the way, but the last print is what we’re really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn’t show up in the output.

JavaScript

Advertisement

Answer

You could do the appending with HDF5/pytables. This keeps it out of RAM.

Use the table format:

JavaScript

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

JavaScript

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it’s cheap.


Edit: you can groupby/sum from the store iteratively since this “map-reduces” over the chunks:

JavaScript

Edit2: Using sum as above doesn’t actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

JavaScript

In python 3 you must import reduce from functools.

Perhaps it’s more pythonic/readable to write this as:

JavaScript

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement