Skip to content
Advertisement

Missing categorical data should be encoded with an all-zero one-hot vector

I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.

For example:

JavaScript

After I put these through scikit’s OneHotEncoder I am expecting the missing data to be encoded as 00, since the docs state that handle_unknown='ignore' causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1] is not an option for me.

What I expect:

JavaScript

Instead OneHotEncoder treats the missing values as another category.

What I get:

JavaScript

I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.

Advertisement

Answer

Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan value. Get the categories_ from your model and create a Boolean mask where is it not nan (I use pd.Series.notna but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:

JavaScript

Edit: also get_dummies kind of gives similar result pd.get_dummies(color_cat[0], sparse=True)

EDIT: After looking a bit more you can specify the parameter categories in OneHotEncoder so if you do:

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement