I have a dataframe containing a column with categorical variables, which also includes NaNs.
Category 1 A 2 A 3 Na 4 B
I’d like to to use sklearn.compose.make_column_transformer()
to prepare the df in a clean way. I tried to impute nan values and OneHotEncode the column with the following code:
from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import make_column_transformer transformer= make_column_transformer( (SimpleImputer(missing_values=np.nan, strategy='most_frequent'), ['Category']), (OneHotEncoder(sparse=False), ['Category']) )
Running the transformer on my training data raises
ValueError: Input contains NaN
transformer.fit(X_train) X_train_trans = transformer.transform(X_train)
The desired output would be something like that:
A B 1 1 0 2 1 0 3 1 0 4 0 1
That raises two questions:
Does the transformer computes both the
SimpleImputer
and theOneHotEncoder
in parallel on the original data or in the order I introduced them in the transformer?How can I change my code so that the
OneHotEncoder
gets the imputed values as an input? I know that I can solve it outside of the transformer with pandas in two different steps, but I’d like to have the code in a clean pipeline format
Advertisement
Answer
You should use sklearn Pipeline to sequentially apply a list of transforms:
from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline s = pd.DataFrame(data={'Category': ['A', 'A', np.nan, 'B']}) category_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')), ('ohe', OneHotEncoder(sparse=False)) ] ) transformer = ColumnTransformer(transformers=[ ('category', category_pipeline , ['Category']) ], ) transformer.fit_transform(s) array([[1., 0.], [1., 0.], [1., 0.], [0., 1.]])