Skip to content
Advertisement

How to use get_dummies or one hot encoding to encode a categorical feature with multiple elements?

I’m working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg.

Rows categories
Row 1 “categorya;categoryb;categoryc”
Row 2 “categorya;categoryb”
Row 3 “categoryc”
Row 4 “categoryb;categoryc”

If I try pd.get_dummies(df,columns=['categories'])

I get back columns with the entirety of the data as the column named e.g a column called categorya;categoryb;categoryc

If I try

pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack(),columns=['categories'])

I get individual column names e.g. categorya, categoryb.

But I’ll only get a 1 in one column e.g. if the original category value was “categoryb;categoryc” I’d only get a 1 in the b rather than c value.

I get the feeling that beyond the issue of coding I may be making a fundamental error in my approach?

Advertisement

Answer

It looks to me like you are changing the shape of the data structure such that it does not match the DF.

df.categories.str.split(";").apply(pd.Series).stack()

0  0    categorya
   1    categoryb
   2    categoryc
1  0    categorya
   1    categoryb
2  0    categoryc
3  0    categoryb
   1    categoryc

and

pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack())

     categorya  categoryb  categoryc
0 0          1          0          0
  1          0          1          0
  2          0          0          1
1 0          1          0          0
  1          0          1          0
2 0          0          0          1
3 0          0          1          0
  1          0          0          1

If you know the categories beforehand you could do something like:

df['categorya'] = np.where(df['categories'].str.contains('categorya'),1,0)

                      categories  categorya
0  categorya;categoryb;categoryc          1
1            categorya;categoryb          1
2                      categoryc          0
3            categoryb;categoryc          0

Or if you don’t know the categories beforehand you could do:

for s in df.categories.str.split(";").apply(pd.Series).stack().unique():
    df[s] = np.where(df['categories'].str.contains(s),1,0)

   categorya  categoryb  categoryc
0          1          1          1
1          1          1          0
2          0          0          1
3          0          1          1

Also, you can aggregate by major index and sum on the categorical (dummies) columns to get what you are looking for.

Grouped get_dummies

Like this:

pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack()) 
    .groupby(level=0).sum()

   categorya  categoryb  categoryc
0          1          1          1
1          1          1          0
2          0          0          1
3          0          1          1

Then the simplest:

df['categories'].str.get_dummies(sep=';')

       categories  catA  catB  catC
0  catA;catB;catC     1     1     1
1       catA;catB     1     1     0
2            catC     0     0     1
3       catB;catC     0     1     1
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement