Skip to content
Advertisement

How to quickly subset many dataframes?

I have 180 DataFrame objects, each one has 3130 rows and it’s about 300KB in memory. The index is a DatetimeIndex, business days from 2000-01-03 to 2011-12-31:

from datetime import datetime
import pandas as pd
freq = pd.tseries.offsets.BDay()

index = pd.date_range(datetime(2000,1,3), datetime(2011,12,31), freq=freq)

df = pd.DataFrame(index=index)
df['A'] = 1000.0
df['B'] = 2000.0
df['C'] = 3000.0
df['D'] = 4000.0
df['E'] = 5000.0
df['F'] = True
df['G'] = 1.0
df['H'] = 100.0

I preprocess all the data taking advantage of numpy/pandas vectorization, then I have to loop through the dataframes day by day. To prevent the possibility of ‘look ahead bias’ and get data from the future I must be sure each day I only return a subset of my dataframes, up to that datapoint. I explain: if the current datapoint I am processing is datetime(2010,5,15) I need data from datetime(2000,1,3) to datetime(2010,5,15). You should not be able to access data more recent than datetime(2010,5,15). With this subset I’ll make other computations I can’t vectorize because they are path dependent.

I modified my original loop like this:

def get_data(datapoint):
    return df.loc[:datapoint]
    
calendar = df.index

for datapoint in calendar:
    x = get_data(datapoint)   

This kind of code is painfully slow. What is my best option to improve its speed? If I do not try to prevent the look ahead bias my production code takes about 3 minutes to run but it is too risky. With code like this it takes 13 minutes and this is unacceptable.

%%timeit

A slightly faster option is using iloc instead of loc but it is still slow:

def get_data2(datapoint):
    idx = df.index.get_loc(datapoint)
    return df.iloc[:idx]

for datapoint in calendar:
    x = get_data(datapoint)  

371 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for datapoint in calendar:
    x = get_data2(datapoint)

327 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The original code, which was not trying to prevent the possibility of look ahead bias, simply returned the whole DataFrame when called for each datapoint. In this example is 100 time faster, real code is 4 times faster.

def get_data_no_check():
    return df

for datapoint in calendar:
    x = get_data_no_check() 

2.87 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Advertisement

Answer

I solved it like this: first I preprocess all my data in the DataFrame to take advantage of pandas vectorization then I convert it into a dict of dict and I iterate over it preventing the possibility of ‘look ahead bias’. Since data are already preprocessed I can avoid the DataFrame overhead. The increase in processing speed in production code let me speechless: down from more than 30 minutes to 40 seconds!

# Convert the DataFrame into a dict of dict

for s, data in self._data.items():
    self._data[s] = data.to_dict(orient='index')
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement