I have 180 DataFrame
objects, each one has 3130 rows and it’s about 300KB in memory.
The index is a DatetimeIndex
, business days from 2000-01-03 to 2011-12-31:
from datetime import datetime import pandas as pd freq = pd.tseries.offsets.BDay() index = pd.date_range(datetime(2000,1,3), datetime(2011,12,31), freq=freq) df = pd.DataFrame(index=index) df['A'] = 1000.0 df['B'] = 2000.0 df['C'] = 3000.0 df['D'] = 4000.0 df['E'] = 5000.0 df['F'] = True df['G'] = 1.0 df['H'] = 100.0
I preprocess all the data taking advantage of numpy/pandas vectorization, then I have to loop through the dataframes day by day. To prevent the possibility of ‘look ahead bias’ and get data from the future I must be sure each day I only return a subset of my dataframes, up to that datapoint. I explain: if the current datapoint I am processing is datetime(2010,5,15)
I need data from datetime(2000,1,3)
to datetime(2010,5,15)
. You should not be able to access data more recent than datetime(2010,5,15)
. With this subset I’ll make other computations I can’t vectorize because they are path dependent.
I modified my original loop like this:
def get_data(datapoint): return df.loc[:datapoint] calendar = df.index for datapoint in calendar: x = get_data(datapoint)
This kind of code is painfully slow. What is my best option to improve its speed? If I do not try to prevent the look ahead bias my production code takes about 3 minutes to run but it is too risky. With code like this it takes 13 minutes and this is unacceptable.
%%timeit
A slightly faster option is using iloc
instead of loc
but it is still slow:
def get_data2(datapoint): idx = df.index.get_loc(datapoint) return df.iloc[:idx] for datapoint in calendar: x = get_data(datapoint)
371 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for datapoint in calendar: x = get_data2(datapoint)
327 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The original code, which was not trying to prevent the possibility of look ahead bias, simply returned the whole DataFrame
when called for each datapoint. In this example is 100 time faster, real code is 4 times faster.
def get_data_no_check(): return df for datapoint in calendar: x = get_data_no_check()
2.87 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Advertisement
Answer
I solved it like this: first I preprocess all my data in the DataFrame
to take advantage of pandas vectorization then I convert it into a dict of dict
and I iterate over it preventing the possibility of ‘look ahead bias’. Since data are already preprocessed I can avoid the DataFrame
overhead. The increase in processing speed in production code let me speechless: down from more than 30 minutes to 40 seconds!
# Convert the DataFrame into a dict of dict for s, data in self._data.items(): self._data[s] = data.to_dict(orient='index')