I’m working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg.
Rows | categories |
---|---|
Row 1 | “categorya;categoryb;categoryc” |
Row 2 | “categorya;categoryb” |
Row 3 | “categoryc” |
Row 4 | “categoryb;categoryc” |
If I try pd.get_dummies(df,columns=['categories'])
I get back columns with the entirety of the data as the column named e.g a column called categorya;categoryb;categoryc
If I try
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack(),columns=['categories'])
I get individual column names e.g. categorya, categoryb.
But I’ll only get a 1 in one column e.g. if the original category value was “categoryb;categoryc” I’d only get a 1 in the b rather than c value.
I get the feeling that beyond the issue of coding I may be making a fundamental error in my approach?
Advertisement
Answer
It looks to me like you are changing the shape of the data structure such that it does not match the DF.
df.categories.str.split(";").apply(pd.Series).stack()
0 0 categorya
1 categoryb
2 categoryc
1 0 categorya
1 categoryb
2 0 categoryc
3 0 categoryb
1 categoryc
and
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack())
categorya categoryb categoryc
0 0 1 0 0
1 0 1 0
2 0 0 1
1 0 1 0 0
1 0 1 0
2 0 0 0 1
3 0 0 1 0
1 0 0 1
If you know the categories beforehand you could do something like:
df['categorya'] = np.where(df['categories'].str.contains('categorya'),1,0)
categories categorya
0 categorya;categoryb;categoryc 1
1 categorya;categoryb 1
2 categoryc 0
3 categoryb;categoryc 0
Or if you don’t know the categories beforehand you could do:
for s in df.categories.str.split(";").apply(pd.Series).stack().unique():
df[s] = np.where(df['categories'].str.contains(s),1,0)
categorya categoryb categoryc
0 1 1 1
1 1 1 0
2 0 0 1
3 0 1 1
Also, you can aggregate by major index and sum on the categorical (dummies) columns to get what you are looking for.
Like this:
pd.get_dummies(df.categories.str.split(";").apply(pd.Series).stack())
.groupby(level=0).sum()
categorya categoryb categoryc
0 1 1 1
1 1 1 0
2 0 0 1
3 0 1 1
Then the simplest:
df['categories'].str.get_dummies(sep=';')
categories catA catB catC
0 catA;catB;catC 1 1 1
1 catA;catB 1 1 0
2 catC 0 0 1
3 catB;catC 0 1 1