The class is composed of a set of attributes and functions including:
Attributes:
- df : a
pandas
dataframe. - numerical_feature_names: df columns with a numeric value.
- label_column_names: df string columns to be grouped.
Functions:
mean(nums)
: takes a list of numbers as input and returns the meanfill_na(df, numerical_feature_names, label_columns)
: takes class attributes as inputs and returns a transformed df.
And here’s the class:
class PLUMBER(): def __init__(self): ################# attributes ################ self.df=df # specify label and numerical features names: self.numerical_feature_names=numerical_feature_names self.label_column_names=label_column_names ##################### mean ############################## def mean(self, nums): total=0.0 for num in nums: total=total+num return total/len(nums) ############ fill the numerical features ################## def fill_na(self, df, numerical_feature_names, label_column_names): # declaring parameters: df=self.df numerical_feature_names=self.numerical_feature_names label_column_names=self.label_column_names # now replacing NaN with group mean for numerical_feature_name in numerical_feature_names: df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x))) return df
When trying to apply it to a pandas df:
if __name__=="__main__": # initialize class plumber=PLUMBER() # replace NaN with group mean df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'], 'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'], 'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'], 'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'], 'number':[np.nan, 450, 299, np.nan, 19, 29], 'age':[np.nan, 30, 28, np.nan, 29, 18]} df=pd.DataFrame(d) # headers column_names=df.columns.values.tolist() column_names= [column_name.strip() for column_name in column_names] # label_column_names (to be grouped) label_column_names=['country', 'level', 'job title'] # numerical_features: numerical_feature_names = [x for x in column_names if x not in label_column_names] numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan
with it’s group mean)?
Advertisement
Answer
First the error is because label_column_names
is already a list
, so in the groupby
you don’t need the []
around it. so it should be df.groupby(label_column_names)...
instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na
of your class, replace the loop for
(you don’t need it actually) by
df[numerical_feature_names] = ( df[numerical_feature_names] .fillna( df.groupby(label_column_names) [numerical_feature_names].transform('mean') ) )
in which you fillna
the columns numerical_feature_names
by the result of the groupy.tranform
with the mean
of these columns