I have a huge data-frame (>20m rows) with each row containing a timestamp and a numeric variable X. I want to assign a new column where for each row the value in this new column is the average of X in the previous rows within a specified time window e.g the average of all rows with time stamps no more than 5 min ago. Since the time stamps aren’t at regular intervals I can’t just take fixed slices
Currently the best approach I’ve found is a “caterpillar” algorithm which loops through the data frame once and moves the start/end index depending on whether the timeframe window has been breached. However this is proving infeasibly slow, and I was wondering if there was a clever vectorised way to do this? Perhaps involving np arrays instead?
Thanks
Set your timestamp to the index and use the rolling() function. You can set the window to a time offset. For example:
# creating the dataset import numpy as np np.random.seed(113) df = pd.DataFrame({'timestamp': pd.date_range('2021-01-01', '2021-01-02', freq='1S'), 'value': np.random.randint(1,10,86401)}) # This randomly drops rows to make the dataset fragmented df = df.drop(np.random.choice(range(1,86401), 85000, replace=False)) df.head(10)
timestamp value 0 2021-01-01 00:00:00 6 58 2021-01-01 00:00:58 7 237 2021-01-01 00:03:57 9 390 2021-01-01 00:06:30 7 481 2021-01-01 00:08:01 8 575 2021-01-01 00:09:35 4 580 2021-01-01 00:09:40 9 735 2021-01-01 00:12:15 7 894 2021-01-01 00:14:54 2 927 2021-01-01 00:15:27 5
Now use the rolling() function with a window of 5 minutes.
df['rolling_mean'] = df.set_index('timestamp') .rolling('5T', closed='left')['value'].mean().values df.head(10)
timestamp value rolling_mean 0 2021-01-01 00:00:00 6 NaN 58 2021-01-01 00:00:58 7 6.000000 237 2021-01-01 00:03:57 9 6.500000 390 2021-01-01 00:06:30 7 9.000000 481 2021-01-01 00:08:01 8 8.000000 575 2021-01-01 00:09:35 4 7.500000 580 2021-01-01 00:09:40 9 6.333333 735 2021-01-01 00:12:15 7 7.000000 894 2021-01-01 00:14:54 2 7.000000 927 2021-01-01 00:15:27 5 4.500000
This should give you mean within the previous 5 minutes, with the current time not included.
Recent Comments