I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.
Let’s illustrate with an example:
We have the following DataFrame columns:
|--------|-------|-----|-------------| | row_id | Month | Day | Temperature | |--------|-------|-----|-------------| | 1 | 1 | 1 | 14.3 | | 2 | 1 | 1 | 14.8 | | 3 | 1 | 2 | 13.1 | |--------|-------|-----|-------------|
We’re simply measuring temperature multiple times a day for many months. Now, let’s assume that for some of our records, the temperature reading failed and we have a NA
.
|--------|-------|-----|-------------| | row_id | Month | Day | Temperature | |--------|-------|-----|-------------| | 1 | 1 | 1 | 14.3 | | 2 | 1 | 1 | 14.8 | | 3 | 1 | 2 | 13.1 | | 4 | 1 | 2 | NA | | 5 | 1 | 3 | 14.8 | | 6 | 1 | 4 | NA | |--------|-------|-----|-------------|
We could just use panda’s .fillna()
, however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100’s per day), we’d like to take the daily average and use that as our fill value.
we can get the daily averages with a simple groupby:
avg_temp_by_month_day = df.groupby(['month'])['day'].mean()
Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?
We could use an apply()
,
df['temperature'] = df.apply( lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'], axis=1 )
however this is really slow (1M+ records).
Is there a vectorized approach, perhaps using np.where()
, or maybe creating another Series and merging.
What’s the a more efficient way to perform this operation?
Thank you!
Advertisement
Answer
I’m not sure if this is the fastest, however instead of taking ~1 hour for apply
, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.
local_avg_cols = ['temperature'] # can work with multiple columns # Create groupby's to get local averages local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean() # Convert to DataFrame and prepare for merge local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index() # Merge into original dataframe df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg')) # Now overwrite na values with values from new '_avg' col for col in local_avg_cols: df[col] = df[col].mask(df[col].isna(), df[col+'_avg']) # Drop new avg cols df = df.drop(columns=[col+'_avg' for col in local_avg_cols])
If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I’ll unmark this answer and mark yours. Thank you!