I have a Pandas Dataframe that contains a series of Airbnb Prices grouped by neighbourhood group neighbourhood and room_type. My objective is to return the Maximum Average Price for each room_type per Neighbourhood and return only this.
My approach to this was to use .groupby and .idxmax() to get the maximum values w.r.t to the Index, and then iterate through this accordingly and append the index positions to a list, which is used for filtering purposes. (This link was helpful: Python Pandas Dataframe select row by max value in group)
df = airbnb.loc[airbnb['neighbourhood'].isin(list(combined['neighbourhood']))].groupby(['neighbourhood','room_type']).mean().sort_values(by=['Revenues'],ascending=False)['Revenues'].reset_index().sort_values(by=['neighbourhood','room_type']).reset_index() tmp_list = [] test = pd.DataFrame(df.groupby(['neighbourhood','room_type'])['Revenues'].idxmax(axis=0)).reset_index() #Re-Assign Revenues back to the Dataframe for Reference test['Actual_Revenues'] = df['Revenues'] #The Maximum Revenue Value for each Sub Grouping is Returned by Index for i in neighbourhood_list: print(test[test['neighbourhood']==i]) max_index_list = test[test['neighbourhood']==i].sort_values(by='Actual_Revenues', ascending=False).head(1)['Revenues'] print(max_index_list) tmp_list.append(list(max_index_list)) tmp_list = list(np.concatenate(tmp_list).flat) df[df.index.isin(tmp_list)]
Now this works smoothly, but I don’t believe this is very performant (esp. with my loop).
How might I seek to optimise my code further to reduce any redundancies and make my code more Pythonic?
Our expected output shows the below, with only the maximum room type and pricing shown per Neighbourhood. (i.e. One unique entry per Neighbourhood).
Thank you very much.
Here is a reproducible dataset you can use to test the code: https://docs.google.com/spreadsheets/d/1x-ktbfJouPzI0hokKw0fSNXAwmD8Q8Yd/edit?usp=sharing&ouid=101578147330059746959&rtpof=true&sd=true
Here is a picture of my output (using the complete dataset as opposed to the partial shown – this is just to show the concept of the maximum price shown per Neighbourhood. If a neighbourhood has multiple room types (which they will), it will return the maximum price and the associated room type.)
Advertisement
Answer
Approach using Series.nlargest()
:
# Get mean price per room_type per neighbourhood means = df.groupby(['neighbourhood', 'room_type'])['price'].mean() # Get the maximum mean price per room_type per neighbourhood max_means = (means.groupby(level=0, group_keys=False) .nlargest(1) .reset_index())
First five rows of max_means
:
neighbourhood | room_type | price |
---|---|---|
Bedford-Stuyvesant | Entire home/apt | 120 |
Chelsea | Private room | 140 |
Chinatown | Entire home/apt | 150 |
Clinton Hill | Entire home/apt | 89 |
Crown Heights | Entire home/apt | 99 |