How to use get_dummies or one hot encoding to encode a categorical feature with multiple elements?

Question

I&#8217;m working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg. Rows categories Row 1 &#8220;categorya;categoryb;categoryc&#8221; Row 2 &#8220;categorya;categoryb&#8221; Row 3 &#8220;categoryc&#8221; Row 4 &#8220;cat…

Accepted Answer

It looks to me like you are changing the shape of the data structure such that it does not match the DF.df.categories.str.split(";").apply(pd.Series).stack()0  0    categorya   1    categoryb   2    categoryc1  0    categorya   1    categoryb2  0    categoryc3  0    categoryb   1    categorycandpd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack())     categorya  categoryb  categoryc0 0          1          0          0  1          0          1          0  2          0          0          11 0          1          0          0  1          0          1          02 0          0          0          13 0          0          1          0  1          0          0          1If you know the categories beforehand you could do something like:df['categorya'] = np.where(df['categories'].str.contains('categorya'),1,0)                      categories  categorya0  categorya;categoryb;categoryc          11            categorya;categoryb          12                      categoryc          03            categoryb;categoryc          0Or if you don&#8217;t know the categories beforehand you could do:for s in df.categories.str.split(";").apply(pd.Series).stack().unique():    df[s] = np.where(df['categories'].str.contains(s),1,0)   categorya  categoryb  categoryc0          1          1          11          1          1          02          0          0          13          0          1          1Also, you can aggregate by major index and sum on the categorical (dummies) columns to get what you are looking for.Like this:pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack())     .groupby(level=0).sum()   categorya  categoryb  categoryc0          1          1          11          1          1          02          0          0          13          0          1          1Then the simplest:df['categories'].str.get_dummies(sep=';')       categories  catA  catB  catC0  catA;catB;catC     1     1     11       catA;catB     1     1     02            catC     0     0     13       catB;catC     0     1     1

Rows	categories
Row 1	“categorya;categoryb;categoryc”
Row 2	“categorya;categoryb”
Row 3	“categoryc”
Row 4	“categoryb;categoryc”

Advertisement

Answer