Pandas optimize performance loc function

Question

I&#8217;ve a dataset with > 50000 rows. Some of the data is missing, for that I&#8217;m using a nested loop and loc function to fill in the missing values. Dataset So what I&#8217;m doing basically is for the second row, I&#8217;ll find the mean of all the rating for usa and mean for all rating for 1, divi…

Accepted Answer

calculate the mean of location and name first separately. then join the origin dataframe.mean_location = df.groupby('location')['rating'].mean().rename('rating_location')mean_name = df.groupby('name')['rating'].mean().rename('rating_name')df = (df.join(mean_location, on='location')        .join(mean_name, on='name'))print(df)       name  rating location  rating_location  rating_name    0     1     3.0       uk              3.0          3.5    1     1     NaN      usa              3.0          3.5    2     2     4.0      usa              3.0          4.0    3     4     2.0      usa              3.0          2.0    4     1     4.0   france              4.0          3.5df['reating_fill'] = (df['rating_location'] + df['rating_name'])/2# use combine_first to fill nadf['rating'] = df['rating'].combine_first(df['reating_fill'])result:df       name  rating location  rating_location  rating_name  reating_fill    0     1    3.00       uk              3.0          3.5          3.25    1     1    3.25      usa              3.0          3.5          3.25    2     2    4.00      usa              3.0          4.0          3.50    3     4    2.00      usa              3.0          2.0          2.50    4     1    4.00   france              4.0          3.5          3.75

Advertisement

Answer