Missing categorical data should be encoded with an all-zero one-hot vector

Question

I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features. For example: After I put these through scikit's OneHotEncoder I am expecting the missing data to be encoded as 00, since the docs state that handle_unknown='ignore' causes the encoder to return an

Accepted Answer

Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan value. Get the categories_ from your model and create a Boolean mask where is it not nan (I use pd.Series.notna but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:# currently you havecolor_one_hot# <3x3 sparse matrix of type ''# with 3 stored elements in Compressed Sparse Row format># line of code to addnew_color_one_hot = color_one_hot[:,pd.Series(color_enc.categories_[0]).notna().to_numpy()]# and now you havenew_color_one_hot# <3x2 sparse matrix of type ''# with 2 stored elements in Compressed Sparse Row format># andnew_color_one_hot.todense()# matrix([[0., 1.],# [1., 0.],# [0., 0.]])Edit: also get_dummies kind of gives similar result pd.get_dummies(color_cat[0], sparse=True)EDIT: After looking a bit more you can specify the parameter categories in OneHotEncoder so if you do:color_cat = pd.DataFrame(['red', 'blue', np.nan])color_enc = OneHotEncoder(categories=[color_cat[0].dropna().unique()], ## here sparse=True, handle_unknown='ignore')color_one_hot = color_enc.fit_transform(color_cat)color_one_hot.todense()# matrix([[1., 0.],# [0., 1.],# [0., 0.]])

Advertisement

Answer