sklearn.compose.make_column_transformer(): using SimpleImputer() and OneHotEncoder() in one step on one dataframe column

I have a dataframe containing a column with categorical variables, which also includes NaNs.

  Category
1 A
2 A
3 Na
4 B

JavaScript
​x
 
  Category
1 A
2 A
3 Na
4 B
​

I’d like to to use sklearn.compose.make_column_transformer() to prepare the df in a clean way. I tried to impute nan values and OneHotEncode the column with the following code:

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

transformer= make_column_transformer(
    (SimpleImputer(missing_values=np.nan, strategy='most_frequent'), ['Category']),
    (OneHotEncoder(sparse=False), ['Category'])
)

JavaScript
 
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
​
transformer= make_column_transformer(
    (SimpleImputer(missing_values=np.nan, strategy='most_frequent'), ['Category']),
    (OneHotEncoder(sparse=False), ['Category'])
)
​

Running the transformer on my training data raises

ValueError: Input contains NaN

transformer.fit(X_train)
X_train_trans = transformer.transform(X_train)

JavaScript
 
transformer.fit(X_train)
X_train_trans = transformer.transform(X_train)
​

The desired output would be something like that:

JavaScript
 
  A B
1 1 0
2 1 0
3 1 0
4 0 1
​

That raises two questions:

Does the transformer computes both the SimpleImputer and the OneHotEncoder in parallel on the original data or in the order I introduced them in the transformer?
How can I change my code so that the OneHotEncoder gets the imputed values as an input? I know that I can solve it outside of the transformer with pandas in two different steps, but I’d like to have the code in a clean pipeline format

Answer

You should use sklearn Pipeline to sequentially apply a list of transforms:

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

s = pd.DataFrame(data={'Category': ['A', 'A', np.nan, 'B']})

category_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False))
    ]
)

transformer = ColumnTransformer(transformers=[
    ('category', category_pipeline , ['Category'])
    ],
)

transformer.fit_transform(s)
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

JavaScript
 
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
​
s = pd.DataFrame(data={'Category': ['A', 'A', np.nan, 'B']})
​
category_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False))
    ]
)
​
transformer = ColumnTransformer(transformers=[
    ('category', category_pipeline , ['Category'])
    ],
)
​
transformer.fit_transform(s)
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])
​

Advertisement

Answer