Pandas optimize performance loc function

Tags: ,



I’ve a dataset with > 50000 rows. Some of the data is missing, for that I’m using a nested loop and loc function to fill in the missing values.

Dataset enter image description here

So what I’m doing basically is for the second row, I’ll find the mean of all the rating for usa and mean for all rating for 1, divide by 2 and use it as rating. In this case it will be (3.25). Code I’ve written:

    for i in data[data.rate.isna()]['name']:
      for j in data.location.unique():
        data.loc[(data.rate.isna()) & (data.name == i) & (data.location == j), rate] = (data[data.name == i].rate.mean() + data[data.location == j].mean())/2

Is there any way to optimize this? This takes a lot of time, I found a finction called at but it uses indexes not conditions, np.where but how will it fit in this case (takes a lot of time or gives me error)?

Answer

calculate the mean of location and name first separately. then join the origin dataframe.

mean_location = df.groupby('location')['rating'].mean().rename('rating_location')
mean_name = df.groupby('name')['rating'].mean().rename('rating_name')

df = (df.join(mean_location, on='location')
        .join(mean_name, on='name'))

print(df)

       name  rating location  rating_location  rating_name
    0     1     3.0       uk              3.0          3.5
    1     1     NaN      usa              3.0          3.5
    2     2     4.0      usa              3.0          4.0
    3     4     2.0      usa              3.0          2.0
    4     1     4.0   france              4.0          3.5


df['reating_fill'] = (df['rating_location'] + df['rating_name'])/2
# use combine_first to fill na
df['rating'] = df['rating'].combine_first(df['reating_fill'])

result:

df

       name  rating location  rating_location  rating_name  reating_fill
    0     1    3.00       uk              3.0          3.5          3.25
    1     1    3.25      usa              3.0          3.5          3.25
    2     2    4.00      usa              3.0          4.0          3.50
    3     4    2.00      usa              3.0          2.0          2.50
    4     1    4.00   france              4.0          3.5          3.75



Source: stackoverflow