What should be the format of one-hot-encoded features for scikit-learn?

Question

I am trying to use the regressor/classifiers of scikit-learn library. I am a bit confused about the format of the one-hot-encoded features since I can send dataframe or numpy arrays to the model. Say I have categorical features named 'a', 'b' and 'c'. Should I give them in separate columns (with pandas.get_dummies()), like below: a b c 1 1 1

Accepted Answer

You can&#8217;t pass a feature containing a merged list directly to the model. You should one-hot encode into separate columns first:If you just want something quick and easy, get_dummies is fine for development, but the following approaches are generally preferred by most sources I&#8217;ve read.If you want to encode your input data, use OneHotEncoder (OHE) to encode one or more columns, then merge with your other features. OHE gives good control over output format, stores intermediate data and has error handling. Good for production.If you need to encode a single column, typically but not limited to labels, use LabelBinarizer to one-hot encode a column with a single value, or use MultiLabelBinarizer to one-hot encode a column with multiple values.Once you have your one-hot encoded data/labels, you don&#8217;t need to &#8220;tell&#8221; the model that certain features are one-hot. You just train the model on the data set using clf.fit(X_train, y_train) and make predictions using clf.predict(X_test).OHE examplefrom sklearn.preprocessing import OneHotEncoderimport pandas as pdX = [['Male', 1], ['Female', 3], ['Female', 2]]ohe = OneHotEncoder(handle_unknown='ignore')X_enc = ohe.fit_transform(X).toarray()# Convert to dataframe if you need to merge this with other features:df = pd.DataFrame(X_enc, columns=ohe.get_feature_names())MLB examplefrom sklearn.preprocessing import MultiLabelBinarizerimport pandas as pddf = pd.DataFrame({   'style': ['Folk', 'Rock', 'Classical'],   'instruments': [['guitar', 'vocals'], ['guitar', 'bass', 'drums', 'vocals'], ['piano']]})mlb = MultiLabelBinarizer()encoded = mlb.fit_transform(df['instruments'])encoded_df = pd.DataFrame(encoded, columns=mlb.classes_, index=df['instruments'].index)# Drop old column and merge new encoded columnsdf = df.drop('instruments', axis=1)df = pd.concat([df, encoded_df], axis=1, sort=False)

merged
1,1,1
1,0,1
0,0,1

Advertisement

Answer