I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I’ve tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby(‘someKey’).sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we’re really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn’t show up in the output.
import pandas as pd import numpy as np import random from pandas.io.pytables import * import os pd.set_option('io.hdf.default_format','table') # create a small dataframe to simulate the real data. def loadFrame(): frame = pd.DataFrame() frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test return frame def pivotSegment(segmentNumber,passedFrame): segmentSize = 3 #take 3 rows at a time frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values span = pd.DataFrame() span['catid'] = range(1,5+1) span['shipmentid']=1 span['qty']=0 frame = frame.append(span) return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', aggfunc='sum',fill_value=0).reset_index() def createStore(): store = pd.HDFStore('testdata.h5') return store segMin = 0 segMax = 4 store = createStore() frame = loadFrame() print('Printing Frame') print(frame) print(frame.info()) for i in range(segMin,segMax): segment = pivotSegment(i,frame) store.append('data',frame[(i*3):(i*3 + 3)]) store.append('pivotedData',segment) print('nPrinting Store') print(store) print('nPrinting Store: data') print(store['data']) print('nPrinting Store: pivotedData') print(store['pivotedData']) print('**************') print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('**************') print('$$$') for df in store.select('pivotedData',chunksize=3): print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('$$$') store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3))) print('nPrinting Store: pivotedAndSummed') print(store['pivotedAndSummed']) store.close() os.remove('testdata.h5') print('closed')
Advertisement
Answer
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5') for ...: ... chunk # the chunk of the DataFrame (which you want to append) store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it’s cheap.
Edit: you can groupby/sum from the store iteratively since this “map-reduces” over the chunks:
# note: this doesn't work, see below sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) # equivalent to (but doesn't read in the entire frame) store['df'].groupby().sum()
Edit2: Using sum as above doesn’t actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce
with add
:
reduce(lambda x, y: x.add(y, fill_value=0), (df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it’s more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000)) res = next(chunks) # will raise if there are no chunks! for c in chunks: res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.