I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.
For example:
0 red 1 blue 2 <missing> color_cat = pd.DataFrame(['red', 'blue', np.NAN]) color_enc = OneHotEncoder(sparse=True, handle_unknown='ignore') color_one_hot = color_enc.fit_transform(color_cat)
After I put these through scikit’s OneHotEncoder
I am expecting the missing data to be encoded as 00
, since the docs state that handle_unknown='ignore'
causes the encoder to return an all zero array. Substituting another value, such as with [SimpleImputer][1]
is not an option for me.
What I expect:
0 10 1 01 2 00
Instead OneHotEncoder
treats the missing values as another category.
What I get:
0 100 1 010 2 001
I have seen the related question: How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? But the solutions do not work for me. I explicitly require a zero vector.
Advertisement
Answer
Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan
value. Get the categories_
from your model and create a Boolean mask where is it not nan
(I use pd.Series.notna
but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:
# currently you have color_one_hot # <3x3 sparse matrix of type '<class 'numpy.float64'>' # with 3 stored elements in Compressed Sparse Row format> # line of code to add new_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()] # and now you have new_color_one_hot # <3x2 sparse matrix of type '<class 'numpy.float64'>' # with 2 stored elements in Compressed Sparse Row format> # and new_color_one_hot.todense() # matrix([[0., 1.], # [1., 0.], # [0., 0.]])
Edit: also get_dummies
kind of gives similar result pd.get_dummies(color_cat[0], sparse=True)
EDIT: After looking a bit more you can specify the parameter categories
in OneHotEncoder
so if you do:
color_cat = pd.DataFrame(['red', 'blue', np.nan]) color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()], ## here sparse=True, handle_unknown='ignore') color_one_hot = color_enc.fit_transform(color_cat) color_one_hot.todense() # matrix([[1., 0.], # [0., 1.], # [0., 0.]])