I have a dataframe that has 2 columns, date and values. I want to replace NaN values in the dataframe with mean values, but with specific condition.
import pandas as pd dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06", "2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"] values = [3,1,6,2,5,7,3, None, 3, None, None, None] df = pd.DataFrame({"date": dates, "values": values }) df = df.sort_values(by = 'date', ascending = False) print(df) date values 0 2022-02-06 7.0 1 2022-02-05 5.0 2 2022-02-04 2.0 3 2022-02-03 6.0 4 2022-02-02 1.0 5 2022-02-01 3.0 6 2021-02-06 NaN 7 2021-02-05 NaN 8 2021-02-04 NaN 9 2021-02-03 3.0 10 2021-02-02 NaN 11 2021-02-01 3.0
NaN values should be replaced with mean value of the values from the same period for the year that has that value (+/- 1 day).
Value for 2021-02-04
should be:
(6 + 2 + 5) / 3 = 4.33
Because dates "2022-02-03", "2022-02-04", "2022-02-05"
have values of 6,2 and 5.
I know how to fill nan values with just mean value, but I do not know how to solve the problem of +/- one day.
Advertisement
Answer
Really hard to say what exactly you want to do, but given your data:
import pandas as pd import numpy as np dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06", "2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"] values = [3,1,6,2,5,7,3, None, 3, None, None, None] df = pd.DataFrame({"date": dates, "values": values }) df = df.sort_values(by = 'date', ascending = False).reset_index(drop=True)
You can try something like this:
def process_data(dates, values): new_values = np.copy(values) indices = np.argwhere(np.isnan(values)) dates_without_year = ['-'.join(d.split('-')[1:]) for d in dates.astype(str)] for i, d in enumerate(dates): if i in indices: possible_dates = np.array(pd.Series(pd.date_range(d - pd.Timedelta(days=1), d + pd.Timedelta(days=1))), dtype='datetime64[D]') possible_dates = ['-'.join(d.split('-')[1:]) for d in possible_dates.astype(str)] mean_values = values[np.argwhere(np.isin(dates_without_year, possible_dates))] new_values[i] = np.mean(mean_values[~np.isnan(mean_values)]) return new_values df['values'] = process_data(np.array(df['date'].values, dtype='datetime64[D]'), df['values'].to_numpy())
date values 0 2022-02-06 7.000000 1 2022-02-05 5.000000 2 2022-02-04 2.000000 3 2022-02-03 6.000000 4 2022-02-02 1.000000 5 2022-02-01 3.000000 6 2021-02-06 6.000000 7 2021-02-05 4.666667 8 2021-02-04 4.000000 9 2021-02-03 3.000000 10 2021-02-02 3.200000 11 2021-02-01 3.000000
Take a close look at, for example, 2021-02-04
which had a NaN
value. I disregard the years (as mentioned in the comments) and just look at the months and days resulting in (6 + 2 + 5 + 3) / 4 = 4.0
, since “2022-02-03”, “2022-02-04”, “2022-02-05”, and “2021-02-03” have values of 6, 2, 5, and 3.