I’m working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg.
Rows | categories |
---|---|
Row 1 | “categorya;categoryb;categoryc” |
Row 2 | “categorya;categoryb” |
Row 3 | “categoryc” |
Row 4 | “categoryb;categoryc” |
If I try pd.get_dummies(df,columns=['categories'])
I get back columns with the entirety of the data as the column named e.g a column called categorya;categoryb;categoryc
If I try
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack(),columns=['categories'])
I get individual column names e.g. categorya, categoryb.
But I’ll only get a 1 in one column e.g. if the original category value was “categoryb;categoryc” I’d only get a 1 in the b rather than c value.
I get the feeling that beyond the issue of coding I may be making a fundamental error in my approach?
Advertisement
Answer
It looks to me like you are changing the shape of the data structure such that it does not match the DF.
df.categories.str.split(";").apply(pd.Series).stack() 0 0 categorya 1 categoryb 2 categoryc 1 0 categorya 1 categoryb 2 0 categoryc 3 0 categoryb 1 categoryc
and
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack()) categorya categoryb categoryc 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 1 0 0 1 0 1 0 2 0 0 0 1 3 0 0 1 0 1 0 0 1
If you know the categories beforehand you could do something like:
df['categorya'] = np.where(df['categories'].str.contains('categorya'),1,0) categories categorya 0 categorya;categoryb;categoryc 1 1 categorya;categoryb 1 2 categoryc 0 3 categoryb;categoryc 0
Or if you don’t know the categories beforehand you could do:
for s in df.categories.str.split(";").apply(pd.Series).stack().unique(): df[s] = np.where(df['categories'].str.contains(s),1,0) categorya categoryb categoryc 0 1 1 1 1 1 1 0 2 0 0 1 3 0 1 1
Also, you can aggregate by major index and sum on the categorical (dummies) columns to get what you are looking for.
Like this:
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack()) .groupby(level=0).sum() categorya categoryb categoryc 0 1 1 1 1 1 1 0 2 0 0 1 3 0 1 1
Then the simplest:
df['categories'].str.get_dummies(sep=';') categories catA catB catC 0 catA;catB;catC 1 1 1 1 catA;catB 1 1 0 2 catC 0 0 1 3 catB;catC 0 1 1