I’ve a dataset with > 50000 rows. Some of the data is missing, for that I’m using a nested loop and loc function to fill in the missing values.
So what I’m doing basically is for the second row, I’ll find the mean of all the rating for usa and mean for all rating for 1, divide by 2 and use it as rating. In this case it will be (3.25). Code I’ve written:
for i in data[data.rate.isna()]['name']: for j in data.location.unique(): data.loc[(data.rate.isna()) & (data.name == i) & (data.location == j), rate] = (data[data.name == i].rate.mean() + data[data.location == j].mean())/2
Is there any way to optimize this? This takes a lot of time, I found a finction called at but it uses indexes not conditions, np.where but how will it fit in this case (takes a lot of time or gives me error)?
Advertisement
Answer
calculate the mean of location and name first separately. then join the origin dataframe.
mean_location = df.groupby('location')['rating'].mean().rename('rating_location') mean_name = df.groupby('name')['rating'].mean().rename('rating_name') df = (df.join(mean_location, on='location') .join(mean_name, on='name')) print(df) name rating location rating_location rating_name 0 1 3.0 uk 3.0 3.5 1 1 NaN usa 3.0 3.5 2 2 4.0 usa 3.0 4.0 3 4 2.0 usa 3.0 2.0 4 1 4.0 france 4.0 3.5 df['reating_fill'] = (df['rating_location'] + df['rating_name'])/2 # use combine_first to fill na df['rating'] = df['rating'].combine_first(df['reating_fill'])
result:
df name rating location rating_location rating_name reating_fill 0 1 3.00 uk 3.0 3.5 3.25 1 1 3.25 usa 3.0 3.5 3.25 2 2 4.00 usa 3.0 4.0 3.50 3 4 2.00 usa 3.0 2.0 2.50 4 1 4.00 france 4.0 3.5 3.75