i want to inpute the missing data based on multivariate imputation, in the below-attached data sets, column A has some missing values, and Column A and Column B have the correlation factor of 0.70. So I want to use a regression kind of realationship so that it will build the relation between Column A and Column B and impute the missing values in Python.
N.B.: I can do it using Mean, median, and mode, but I want to use the relationship from another column to fill the missing value.
How to deal the problem. your solution, please
import pandas as pd from sklearn.preprocessing import Imputer import numpy as np # assign data of lists. data = {'Date': ['9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14','9/19/14', '9/20/14', '9/21/14', '9/21/14', '9/21/14'], 'A': [77.13, 39.58, 33.70, np.nan, np.nan,39.66, 64.625, 80.04, np.nan ,np.nan ,19.43, 54.375, 38.41], 'B': [19.5, 21.61, 22.25, 25.05, 24.20, 23.55, 5.70, 2.675, 2.05,4.06, -0.80, 0.45, -0.90], 'C':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c']} # Create DataFrame df = pd.DataFrame(data) df["Date"]= pd.to_datetime(df["Date"]) # Print the output. print(df)
Advertisement
Answer
Use:
dfreg = df[df['A'].notna()] dfimp = df[df['A'].isna()] from sklearn.neural_network import MLPRegressor regr = MLPRegressor(random_state=1, max_iter=200).fit(dfreg['B'].values.reshape(-1, 1), dfreg['A']) regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A']) regr.predict(dfimp['B'].values.reshape(-1, 1))
Note that in the provided data correlation of the A and B columns are very low (less than .05). For replacing the imputed values with empty cells:
s = df[df['A'].isna()]['A'].index df.loc[s, 'A'] = regr.score(dfreg['B'].values.reshape(-1, 1), dfreg['A'])
Output: