Skip to content
Advertisement

Pandas DataFrame: Fill NA values based on group mean

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.

Let’s illustrate with an example:

We have the following DataFrame columns:

JavaScript

We’re simply measuring temperature multiple times a day for many months. Now, let’s assume that for some of our records, the temperature reading failed and we have a NA.

JavaScript

We could just use panda’s .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100’s per day), we’d like to take the daily average and use that as our fill value.

we can get the daily averages with a simple groupby:

avg_temp_by_month_day = df.groupby(['month'])['day'].mean()

Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?

We could use an apply(),

JavaScript

however this is really slow (1M+ records).

Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.

What’s the a more efficient way to perform this operation?

Thank you!

Advertisement

Answer

I’m not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.

JavaScript

If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I’ll unmark this answer and mark yours. Thank you!

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement