I have a dataframe with a timeseries of sales of different items with customer analytics. For each item and a given day I want to compute:
- a share of my best customer in last 2 days total sales
- a share of my top customers (from a list) in last 2 days total sales
I’ve tried solutions provided here:
- for rolling sum: Pandas sum over a date range for each category separately
- for conditioned groupby: Pandas groupby with identification of an element with max value in another column
An example dataframe is can be generated by:
import pandas as pd from datetime import timedelta df_1 = pd.DataFrame() df_2 = pd.DataFrame() df_3 = pd.DataFrame() # Create datetimes and data df_1['item'] = [1, 1, 1, 2, 2, 2, 2] df_1['date'] = pd.date_range('1/1/2018', periods=7, freq='D') df_1['customer'] = ['a', 'b', 'c', 'a', 'b', 'b', 'c'] df_1['sales'] = [2, 4, 1, 5, 7, 2, 3] df_2['item'] = [1, 1, 1, 2, 2, 2, 2] df_2['date'] = pd.date_range('1/1/2018', periods=7, freq='D') df_2['customer'] = ['b', 'b', 'c', 'a', 'a', 'c', 'a'] df_2['sales'] = [2, 3, 4, 2, 3, 5, 6] df_3['item'] = [1, 1, 1, 2, 2, 2, 2] df_3['date'] = pd.date_range('1/1/2018', periods=7, freq='D') df_3['customer'] = ['b', 'c', 'a', 'c', 'b', 'a', 'b'] df_3['sales'] = [6, 5, 2, 3, 4, 5, 6] df = pd.concat([df_1, df_2, df_3]) df = df.sort_values(['item', 'date']) df.reset_index(drop=True)
and looks like this:
item | date | customer | sales |
---|---|---|---|
1 | 2018-01-01 | a | 2 |
1 | 2018-01-01 | b | 2 |
1 | 2018-01-01 | b | 6 |
1 | 2018-01-02 | b | 4 |
1 | 2018-01-02 | b | 3 |
1 | 2018-01-02 | c | 5 |
1 | 2018-01-03 | c | 1 |
1 | 2018-01-03 | c | 4 |
1 | 2018-01-03 | a | 2 |
2 | 2018-01-04 | a | 5 |
2 | 2018-01-04 | a | 2 |
2 | 2018-01-04 | c | 3 |
2 | 2018-01-05 | b | 7 |
2 | 2018-01-05 | a | 3 |
2 | 2018-01-05 | b | 4 |
2 | 2018-01-06 | b | 2 |
2 | 2018-01-06 | c | 5 |
2 | 2018-01-06 | a | 5 |
2 | 2018-01-07 | c | 3 |
2 | 2018-01-07 | a | 6 |
2 | 2018-01-07 | b | 6 |
I expect the following results:
item | date | sales_at_day | sales_last_2_days | a_share | top_share |
---|---|---|---|---|---|
1 | 2018-01-01 | 10 | NaN | NaN | NaN |
1 | 2018-01-02 | 12 | 10 | 0.20 | 0.20 |
1 | 2018-01-03 | 7 | 22 | 0.09 | 0.09 |
2 | 2018-01-04 | 10 | NaN | NaN | NaN |
2 | 2018-01-05 | 14 | 10 | 0.70 | 1.00 |
2 | 2018-01-06 | 12 | 24 | 0.29 | 0.42 |
2 | 2018-01-07 | 15 | 26 | 0.31 | 0.50 |
where,
a_share
is the share of sales of customer ‘a’ in total sales in last 2 days (not including present day)
top_share
is the share of sales of customers in a
top_cust = ['a', 'c']
list in total sales in last 2 days (not including present day)
Any ideas? Many Thanks in advance :)
Andy
Advertisement
Answer
Use:
#custom rolling with shift first day f = lambda x: x.rolling(2, min_periods=1).sum().shift() #aggregate sum df1 = df.groupby(['item','date'], as_index=False)['sales'].sum() #apply custom rolling per groups df1['sales_last_2_days'] = df1.groupby('item')['sales'].apply(f).reset_index(drop=True, level=0) #filter customer a and aggregate sum a = df[df['customer'].eq('a')].groupby(['item','date'])['sales'].sum().rename('a_share') #add new column to original df1 = df1.join(a, on=['item','date']) #applt custom rolling per groups and divide df1['a_share'] = df1.groupby('item')['a_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days'] #verys similar like before, only test membership by isin top_cust = ['a', 'c'] a = df[df['customer'].isin(top_cust)].groupby(['item','date'])['sales'].sum().rename('top_share') df1 = df1.join(a, on=['item','date']) df1['top_share'] = df1.groupby('item')['top_share'].apply(f).reset_index(drop=True, level=0) / df1['sales_last_2_days'] print (df1) item date sales sales_last_2_days a_share top_share 0 1 2018-01-01 10 NaN NaN NaN 1 1 2018-01-02 12 10.0 0.200000 0.200000 2 1 2018-01-03 7 22.0 0.090909 0.318182 3 2 2018-01-04 10 NaN NaN NaN 4 2 2018-01-05 14 10.0 0.700000 1.000000 5 2 2018-01-06 12 24.0 0.416667 0.541667 6 2 2018-01-07 15 26.0 0.307692 0.500000
If want use rolling
with days, it is more complicated:
df1 = df.groupby(['item','date'], as_index=False)['sales'].sum() sales1 = (df1.set_index('date') .groupby('item')['sales'] .rolling('2D', min_periods=1) .sum() .groupby('item') .shift() .rename('sales_last_2_days') ) df1 = df1.join(sales1, on=['item','date'])
df2 = df[df['customer'].eq('a')].groupby(['item','date'], as_index=False)['sales'].sum() a = (df2.set_index('date') .groupby('item')[['sales']] .apply(lambda x: x.asfreq('D')) .reset_index(level=0) .groupby('item')['sales'] .rolling('2D', min_periods=1) .sum() .groupby('item') .shift() .rename('a_share') ) print (a) df1 = df1.join(a, on=['item','date']) df1['a_share'] /= df1['sales_last_2_days']
top_cust = ['a', 'c'] df3 = df[df['customer'].isin(top_cust)].groupby(['item','date'], as_index=False)['sales'].sum() b = (df3.set_index('date') .groupby('item')[['sales']] .apply(lambda x: x.asfreq('D')) .reset_index(level=0) .groupby('item')['sales'] .rolling('2D', min_periods=1) .sum() .groupby('item') .shift() .rename('top_share') ) df1 = df1.join(b, on=['item','date']) df1['top_share'] /= df1['sales_last_2_days']
print (df1) item date sales sales_last_2_days a_share top_share 0 1 2018-01-01 10 NaN NaN NaN 1 1 2018-01-02 12 10.0 0.200000 0.200000 2 1 2018-01-03 7 22.0 0.090909 0.318182 3 2 2018-01-04 10 NaN NaN NaN 4 2 2018-01-05 14 10.0 0.700000 1.000000 5 2 2018-01-06 12 24.0 0.416667 0.541667 6 2 2018-01-07 15 26.0 0.307692 0.500000