Skip to content
Advertisement

Pandas average of previous rows fulfilling condition

I have a huge data-frame (>20m rows) with each row containing a timestamp and a numeric variable X. I want to assign a new column where for each row the value in this new column is the average of X in the previous rows within a specified time window e.g the average of all rows with time stamps no more than 5 min ago. Since the time stamps aren’t at regular intervals I can’t just take fixed slices

Currently the best approach I’ve found is a “caterpillar” algorithm which loops through the data frame once and moves the start/end index depending on whether the timeframe window has been breached. However this is proving infeasibly slow, and I was wondering if there was a clever vectorised way to do this? Perhaps involving np arrays instead?

Thanks

Advertisement

Answer

Set your timestamp to the index and use the rolling() function. You can set the window to a time offset. For example:

# creating the dataset
import numpy as np

np.random.seed(113)

df = pd.DataFrame({'timestamp': pd.date_range('2021-01-01', '2021-01-02', freq='1S'),
                 'value': np.random.randint(1,10,86401)})

# This randomly drops rows to make the dataset fragmented
df = df.drop(np.random.choice(range(1,86401), 85000, replace=False))
df.head(10)
              timestamp  value
0   2021-01-01 00:00:00      6
58  2021-01-01 00:00:58      7
237 2021-01-01 00:03:57      9
390 2021-01-01 00:06:30      7
481 2021-01-01 00:08:01      8
575 2021-01-01 00:09:35      4
580 2021-01-01 00:09:40      9
735 2021-01-01 00:12:15      7
894 2021-01-01 00:14:54      2
927 2021-01-01 00:15:27      5

Now use the rolling() function with a window of 5 minutes.

df['rolling_mean'] = df.set_index('timestamp') 
  .rolling('5T', closed='left')['value'].mean().values

df.head(10)
              timestamp  value  rolling_mean
0   2021-01-01 00:00:00      6           NaN
58  2021-01-01 00:00:58      7      6.000000
237 2021-01-01 00:03:57      9      6.500000
390 2021-01-01 00:06:30      7      9.000000
481 2021-01-01 00:08:01      8      8.000000
575 2021-01-01 00:09:35      4      7.500000
580 2021-01-01 00:09:40      9      6.333333
735 2021-01-01 00:12:15      7      7.000000
894 2021-01-01 00:14:54      2      7.000000
927 2021-01-01 00:15:27      5      4.500000

This should give you mean within the previous 5 minutes, with the current time not included.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement