I have a huge data-frame (>20m rows) with each row containing a timestamp and a numeric variable X. I want to assign a new column where for each row the value in this new column is the average of X in the previous rows within a specified time window e.g the average of all rows with time stamps no more than 5 min ago. Since the time stamps aren’t at regular intervals I can’t just take fixed slices
Currently the best approach I’ve found is a “caterpillar” algorithm which loops through the data frame once and moves the start/end index depending on whether the timeframe window has been breached. However this is proving infeasibly slow, and I was wondering if there was a clever vectorised way to do this? Perhaps involving np arrays instead?
Thanks
Advertisement
Answer
Set your timestamp to the index and use the rolling() function. You can set the window to a time offset. For example:
# creating the dataset import numpy as np np.random.seed(113) df = pd.DataFrame({'timestamp': pd.date_range('2021-01-01', '2021-01-02', freq='1S'), 'value': np.random.randint(1,10,86401)}) # This randomly drops rows to make the dataset fragmented df = df.drop(np.random.choice(range(1,86401), 85000, replace=False)) df.head(10)
timestamp value 0 2021-01-01 00:00:00 6 58 2021-01-01 00:00:58 7 237 2021-01-01 00:03:57 9 390 2021-01-01 00:06:30 7 481 2021-01-01 00:08:01 8 575 2021-01-01 00:09:35 4 580 2021-01-01 00:09:40 9 735 2021-01-01 00:12:15 7 894 2021-01-01 00:14:54 2 927 2021-01-01 00:15:27 5
Now use the rolling() function with a window of 5 minutes.
df['rolling_mean'] = df.set_index('timestamp') .rolling('5T', closed='left')['value'].mean().values df.head(10)
timestamp value rolling_mean 0 2021-01-01 00:00:00 6 NaN 58 2021-01-01 00:00:58 7 6.000000 237 2021-01-01 00:03:57 9 6.500000 390 2021-01-01 00:06:30 7 9.000000 481 2021-01-01 00:08:01 8 8.000000 575 2021-01-01 00:09:35 4 7.500000 580 2021-01-01 00:09:40 9 6.333333 735 2021-01-01 00:12:15 7 7.000000 894 2021-01-01 00:14:54 2 7.000000 927 2021-01-01 00:15:27 5 4.500000
This should give you mean within the previous 5 minutes, with the current time not included.