I’d like to calculate a lagged rolling average on a complicated time-series dataset. Consider the toy example as follows:
import numpy as np import pandas as pd np.random.seed(101) fruit = ['apples', 'apples', 'apples', 'oranges', 'apples', 'oranges', 'oranges', 'oranges', 'apples', 'oranges', 'apples', 'apples'] people = ['alice']*6+['bob']*6 date = ['2022-01-01', '2022-01-03', '2022-01-04', '2022-01-04', '2022-01-11', '2022-01-11', '2022-01-04', '2022-01-05', '2022-01-05', '2022-01-20', '2022-01-20', '2022-01-25'] count = np.random.poisson(4,size=12) weight_per = np.round(np.random.uniform(1,3,size=12),2) df = pd.DataFrame({'date':date, 'people':people, 'fruit':fruit, 'count':count, 'weight':weight_per*count}) df['date'] = pd.to_datetime(df.date)
This results in the following DataFrame:
date people fruit count weight 0 2022-01-01 alice apples 2 2.72 1 2022-01-03 alice apples 6 11.28 2 2022-01-04 alice apples 5 13.80 3 2022-01-04 alice oranges 3 8.70 4 2022-01-11 alice apples 2 3.92 5 2022-01-11 alice oranges 3 5.76 6 2022-01-04 bob oranges 8 18.16 7 2022-01-05 bob oranges 5 8.25 8 2022-01-05 bob apples 5 6.20 9 2022-01-20 bob oranges 4 4.40 10 2022-01-20 bob apples 2 4.56 11 2022-01-25 bob apples 2 5.24
Now I’d like to add a column representing the average weight per fruit for the previous 7 days: wgt_per_frt_prev_7d
. It should be defined as the sum of all the fruit weights divided by the sum of all the fruit counts for the past 7 days, not including the current day. While there are many ways to brute force this answer, I’m looking for something with relatively good time complexity. If I were to calculate this column by hand, these would be the calculations and expected results:
df['wgt_per_frt_prev_7d'] = np.nan df.loc[1, 'wgt_per_frt_prev_7d'] = 2.72/2 # row 0 df.loc[2, 'wgt_per_frt_prev_7d'] = (2.72+11.28)/(2+6) # row 0 and 1 df.loc[3, 'wgt_per_frt_prev_7d'] = (2.72+11.28)/(2+6) df.loc[4, 'wgt_per_frt_prev_7d'] = (8.70+13.80+6.20+8.25+18.16)/(3+5+5+5+8) # row 2,3,6,7,8 df.loc[5, 'wgt_per_frt_prev_7d'] = (8.70+13.80+6.20+8.25+18.16)/(3+5+5+5+8) df.loc[6, 'wgt_per_frt_prev_7d'] = (2.72+11.28)/(2+6) # row 0,1 df.loc[7, 'wgt_per_frt_prev_7d'] = (8.70+13.80+2.72+11.28+18.16)/(3+5+6+2+8) # row 0,1,2,3,6 df.loc[8, 'wgt_per_frt_prev_7d'] = (8.70+13.80+2.72+11.28+18.16)/(3+5+6+2+8) df.loc[11, 'wgt_per_frt_prev_7d'] = (4.40+4.56)/(2+4) # row 9,10
Final DF:
date people fruit count weight wgt_per_frt_prev_7d 0 2022-01-01 alice apples 2 2.72 NaN 1 2022-01-03 alice apples 6 11.28 1.360000 2 2022-01-04 alice apples 5 13.80 1.750000 3 2022-01-04 alice oranges 3 8.70 1.750000 4 2022-01-11 alice apples 2 3.92 2.119615 5 2022-01-11 alice oranges 3 5.76 2.119615 6 2022-01-04 bob oranges 8 18.16 1.750000 7 2022-01-05 bob oranges 5 8.25 2.277500 8 2022-01-05 bob apples 5 6.20 2.277500 9 2022-01-20 bob oranges 4 4.40 NaN 10 2022-01-20 bob apples 2 4.56 NaN 11 2022-01-25 bob apples 2 5.24 1.493333
EDIT
The final column I’d like to add is wgt_per_apl_prev_7d
, which only considers the apple weights when calculating this field, but still applies to all rows, even rows with just oranges. The output of this calculation should be as follows:
date people fruit count weight wgt_per_frt_prev_7d wgt_per_apl_prev_7d 0 2022-01-01 alice apples 2 2.72 NaN NaN 1 2022-01-03 alice apples 6 11.28 1.360000 1.360000 2 2022-01-04 alice apples 5 13.80 1.750000 1.750000 3 2022-01-04 alice oranges 3 8.70 1.750000 1.750000 4 2022-01-11 alice apples 2 3.92 2.119615 2.000000 5 2022-01-11 alice oranges 3 5.76 2.119615 2.000000 6 2022-01-04 bob oranges 8 18.16 1.750000 1.750000 7 2022-01-05 bob oranges 5 8.25 2.277500 2.138462 8 2022-01-05 bob apples 5 6.20 2.277500 2.138462 9 2022-01-20 bob oranges 4 4.40 NaN NaN 10 2022-01-20 bob apples 2 4.56 NaN NaN 11 2022-01-25 bob apples 2 5.24 1.493333 2.280000
Advertisement
Answer
Try this
df2 = df[['date', 'count', 'weight']].groupby('date').sum() df2 = df2.rolling('8D').apply(np.sum, raw=True) - df2 df = df.merge((df2['weight']/df2['count']).rename('avg').to_frame().reset_index(), on='date', how='left') df2 = df[df['fruit'] == 'apples'][['date', 'count', 'weight']].groupby('date').sum() df2 = df2.rolling('8D').apply(np.sum, raw=True) - df2 df = df.merge((df2['weight']/df2['count']).rename('avg_apple').to_frame().reset_index(), on='date', how='left')
Output
date people fruit count weight avg avg_apple 0 2022-01-01 alice apples 2 2.72 NaN NaN 1 2022-01-03 alice apples 6 11.28 1.360000 1.360000 2 2022-01-04 alice apples 5 13.80 1.750000 1.750000 3 2022-01-04 alice oranges 3 8.70 1.750000 1.750000 4 2022-01-11 alice apples 2 3.92 2.119615 2.000000 5 2022-01-11 alice oranges 3 5.76 2.119615 2.000000 6 2022-01-04 bob oranges 8 18.16 1.750000 1.750000 7 2022-01-05 bob oranges 5 8.25 2.277500 2.138462 8 2022-01-05 bob apples 5 6.20 2.277500 2.138462 9 2022-01-20 bob oranges 4 4.40 NaN NaN 10 2022-01-20 bob apples 2 4.56 NaN NaN 11 2022-01-25 bob apples 2 5.24 1.493333 2.280000