Skip to content
Advertisement

Pandas DataFrame: Fill NA values based on group mean

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.

Let’s illustrate with an example:

We have the following DataFrame columns:

|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1      | 1     | 1   | 14.3        |
| 2      | 1     | 1   | 14.8        |
| 3      | 1     | 2   | 13.1        |
|--------|-------|-----|-------------|

We’re simply measuring temperature multiple times a day for many months. Now, let’s assume that for some of our records, the temperature reading failed and we have a NA.

|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1      | 1     | 1   | 14.3        |
| 2      | 1     | 1   | 14.8        |
| 3      | 1     | 2   | 13.1        |
| 4      | 1     | 2   | NA          |
| 5      | 1     | 3   | 14.8        |
| 6      | 1     | 4   | NA          |
|--------|-------|-----|-------------|

We could just use panda’s .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100’s per day), we’d like to take the daily average and use that as our fill value.

we can get the daily averages with a simple groupby:

avg_temp_by_month_day = df.groupby(['month'])['day'].mean()

Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?

We could use an apply(),

df['temperature'] = df.apply(
    lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'], 
    axis=1
)

however this is really slow (1M+ records).

Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.

What’s the a more efficient way to perform this operation?

Thank you!

Advertisement

Answer

I’m not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.

local_avg_cols = ['temperature'] # can work with multiple columns

# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()

# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()

# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))

# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
    df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
    
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])

If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I’ll unmark this answer and mark yours. Thank you!

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement