Skip to content
Advertisement

How to use get_dummies or one hot encoding to encode a categorical feature with multiple elements?

I’m working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg.

Rows categories
Row 1 “categorya;categoryb;categoryc”
Row 2 “categorya;categoryb”
Row 3 “categoryc”
Row 4 “categoryb;categoryc”

If I try pd.get_dummies(df,columns=['categories'])

I get back columns with the entirety of the data as the column named e.g a column called categorya;categoryb;categoryc

If I try

JavaScript

I get individual column names e.g. categorya, categoryb.

But I’ll only get a 1 in one column e.g. if the original category value was “categoryb;categoryc” I’d only get a 1 in the b rather than c value.

I get the feeling that beyond the issue of coding I may be making a fundamental error in my approach?

Advertisement

Answer

It looks to me like you are changing the shape of the data structure such that it does not match the DF.

JavaScript

and

JavaScript

If you know the categories beforehand you could do something like:

JavaScript

Or if you don’t know the categories beforehand you could do:

JavaScript

Also, you can aggregate by major index and sum on the categorical (dummies) columns to get what you are looking for.

Grouped get_dummies

Like this:

JavaScript

Then the simplest:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement