I am trying to create a lookup reference table in Python that calculates the cumulative mean
of a Player’s previous (by datetime
) games scores, grouped by venue. However, for my specific need, a player should have previously played a minimum of 2 times at the relevant Venue for a 'Venue Preference'
cumulative mean
calculation.
df
format looks like the following:
DateTime | Player | Venue | Score |
---|---|---|---|
2021-09-25 17:15:00 | Tim | Stadium A | 20 |
2021-09-27 10:00:00 | Blake | Stadium B | 30 |
My existing code that works perfectly, but unfortunately is very slow, is as follows:
JavaScript
x
11
11
1
import numpy as np
2
import pandas as pd
3
4
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
5
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
6
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
7
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
8
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
9
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
10
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
11
I am sure there is a way to calculate the cumulative mean
in one step without first calculating the cumulative sum
and cumulative count
, but unfortunately I couldn’t get that to work.
Advertisement
Answer
IIUC remove 2 groupby by aggregate by sum
and size
first and then cumulative sum by both columns:
JavaScript
1
9
1
df1 = df.groupby(['DateTime', 'Player', 'Venue'])['Score'].agg(['sum','count'])
2
df1 = df1.groupby(['Player', 'Venue'])[['sum', 'count']].cumsum().reset_index()
3
df1['Venue Preference'] = np.where(df1['count'] >= 2, df1['sum'] / df1['count'], np.nan)
4
df1 = df1.drop(['sum', 'count'], axis=1)
5
print (df1)
6
DateTime Player Venue Venue Preference
7
0 2021-09-25 17:15:00 Tim Stadium A NaN
8
1 2021-09-27 10:00:00 Blake Stadium B NaN
9