Pandas average of previous rows fulfilling condition

Tags: , , , ,



I have a huge data-frame (>20m rows) with each row containing a timestamp and a numeric variable X. I want to assign a new column where for each row the value in this new column is the average of X in the previous rows within a specified time window e.g the average of all rows with time stamps no more than 5 min ago. Since the time stamps aren’t at regular intervals I can’t just take fixed slices

Currently the best approach I’ve found is a “caterpillar” algorithm which loops through the data frame once and moves the start/end index depending on whether the timeframe window has been breached. However this is proving infeasibly slow, and I was wondering if there was a clever vectorised way to do this? Perhaps involving np arrays instead?

Thanks

Answer

Set your timestamp to the index and use the rolling() function. You can set the window to a time offset. For example:

# creating the dataset
import numpy as np

np.random.seed(113)

df = pd.DataFrame({'timestamp': pd.date_range('2021-01-01', '2021-01-02', freq='1S'),
                 'value': np.random.randint(1,10,86401)})

# This randomly drops rows to make the dataset fragmented
df = df.drop(np.random.choice(range(1,86401), 85000, replace=False))
df.head(10)
              timestamp  value
0   2021-01-01 00:00:00      6
58  2021-01-01 00:00:58      7
237 2021-01-01 00:03:57      9
390 2021-01-01 00:06:30      7
481 2021-01-01 00:08:01      8
575 2021-01-01 00:09:35      4
580 2021-01-01 00:09:40      9
735 2021-01-01 00:12:15      7
894 2021-01-01 00:14:54      2
927 2021-01-01 00:15:27      5

Now use the rolling() function with a window of 5 minutes.

df['rolling_mean'] = df.set_index('timestamp') 
  .rolling('5T', closed='left')['value'].mean().values

df.head(10)
              timestamp  value  rolling_mean
0   2021-01-01 00:00:00      6           NaN
58  2021-01-01 00:00:58      7      6.000000
237 2021-01-01 00:03:57      9      6.500000
390 2021-01-01 00:06:30      7      9.000000
481 2021-01-01 00:08:01      8      8.000000
575 2021-01-01 00:09:35      4      7.500000
580 2021-01-01 00:09:40      9      6.333333
735 2021-01-01 00:12:15      7      7.000000
894 2021-01-01 00:14:54      2      7.000000
927 2021-01-01 00:15:27      5      4.500000

This should give you mean within the previous 5 minutes, with the current time not included.



Source: stackoverflow