I have a df with a race column, which has 4 categories. However, I would like to only have three categories by combining the last two categories. This is what my current df looks like:
Year |State| Race |Sex | population -----|-----|------|----|---------- 2006 | CA | 1 |1 | 504893 2006 | CA | 1 |2 | 783760 2006 | CA | 2 |1 | 800062 2006 | CA | 2 |2 | 768300 2006 | CA | 3 |1 | 347827 2006 | CA | 3 |2 | 344672 2006 | CA | 4 |1 | 565344 2006 | CA | 4 |2 | 356779
I want to consolidate the race==3 and race ==4 into one value (which would be race ==3). So my new df output would look something like this:
Year |State| Race |Sex | population -----|-----|------|----|---------- 2006 | CA | 1 |1 | 504893 2006 | CA | 1 |2 | 783760 2006 | CA | 2 |1 | 800062 2006 | CA | 2 |2 | 768300 2006 | CA | 3 |1 | 913171 2006 | CA | 3 |2 | 701451
Notice, the last two rows in this df are the sum of the the last 4 rows on the last df (by sex). How can I accomplish this?
Advertisement
Answer
Replace Race 4 by 3 and group data by Race + Sex
df.loc[df['Race']==4, 'Race']=3 df = df.groupby(['Race','Sex'],as_index=False)['population'].sum()
You get
Year State Race Sex population 2006 CA 1 1 504893 2006 CA 1 2 783760 2006 CA 2 1 800062 2006 CA 2 2 768300 2006 CA 3 1 913171 2006 CA 3 2 701451