Skip to content
Advertisement

What should be the format of one-hot-encoded features for scikit-learn?

I am trying to use the regressor/classifiers of scikit-learn library. I am a bit confused about the format of the one-hot-encoded features since I can send dataframe or numpy arrays to the model. Say I have categorical features named ‘a’, ‘b’ and ‘c’. Should I give them in separate columns (with pandas.get_dummies()), like below:

a b c
1 1 1
1 0 1
0 0 1

or like this (merged all)

merged
1,1,1
1,0,1
0,0,1

And how to tell to the scikit-learn model that these are one-hot-encoded categorical features?

Advertisement

Answer

You can’t pass a feature containing a merged list directly to the model. You should one-hot encode into separate columns first:

  • If you just want something quick and easy, get_dummies is fine for development, but the following approaches are generally preferred by most sources I’ve read.
  • If you want to encode your input data, use OneHotEncoder (OHE) to encode one or more columns, then merge with your other features. OHE gives good control over output format, stores intermediate data and has error handling. Good for production.
  • If you need to encode a single column, typically but not limited to labels, use LabelBinarizer to one-hot encode a column with a single value, or use MultiLabelBinarizer to one-hot encode a column with multiple values.

Once you have your one-hot encoded data/labels, you don’t need to “tell” the model that certain features are one-hot. You just train the model on the data set using clf.fit(X_train, y_train) and make predictions using clf.predict(X_test).

OHE example

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

X = [['Male', 1], ['Female', 3], ['Female', 2]]
ohe = OneHotEncoder(handle_unknown='ignore')
X_enc = ohe.fit_transform(X).toarray()

# Convert to dataframe if you need to merge this with other features:
df = pd.DataFrame(X_enc, columns=ohe.get_feature_names())

MLB example

from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

df = pd.DataFrame({
   'style': ['Folk', 'Rock', 'Classical'],
   'instruments': [['guitar', 'vocals'], ['guitar', 'bass', 'drums', 'vocals'], ['piano']]
})

mlb = MultiLabelBinarizer()
encoded = mlb.fit_transform(df['instruments'])
encoded_df = pd.DataFrame(encoded, columns=mlb.classes_, index=df['instruments'].index)

# Drop old column and merge new encoded columns
df = df.drop('instruments', axis=1)
df = pd.concat([df, encoded_df], axis=1, sort=False)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement