I’m trying to use groupby and agg() function for this data processing step:
Input:
ID Fruit 1 Apples 1 Oranges 2 Strawberries 3 Blueberries 4 Blueberries 4 Apples
I plan to aggregate the data by ID. The requirement is if apples and oranges show up for the same ID, keep ‘Apples’; for other combinations, keep the first observation for each ID. So wanted output:
ID Fruit 1 Apples 2 Strawberries 3 Blueberries 4 Blueberries
I could pivot the table and use np.where; however, in this particular situation, I’d prefer to do it in a long format. Please help me with this data processing step. Thank you.
Advertisement
Answer
Use:
#sample data print (df) ID Fruit 0 1 Apples 1 1 Oranges 2 2 Strawberries 3 3 Blueberries 4 4 Blueberries 5 4 Apples 6 5 Blueberries 7 5 Oranges 8 5 Apples
You can aggregate set and GroupBy.first:
L = ['Apples','Oranges']
df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
ID set first
0 1 {Oranges, Apples} Apples
1 2 {Strawberries} Strawberries
2 3 {Blueberries} Blueberries
3 4 {Blueberries, Apples} Blueberries
4 5 {Oranges, Blueberries, Apples} Blueberries
And then test if all categories from list L exist in column set by Series.ge also possible another categories like ID=5 – then is assigned Apple else values from column first:
df1['Fruit'] = np.where(df1.pop('set').ge(set(L)), 'Apple', df1.pop('first'))
print (df1)
ID Fruit
0 1 Apple
1 2 Strawberries
2 3 Blueberries
3 4 Blueberries
4 5 Apple
Or if need test if exist only unique values from list converted to sets use Series.eq:
L = ['Apples','Oranges']
df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index()
print (df1)
ID set first
0 1 {Oranges, Apples} Apples
1 2 {Strawberries} Strawberries
2 3 {Blueberries} Blueberries
3 4 {Blueberries, Apples} Blueberries
4 5 {Oranges, Blueberries, Apples} Blueberries
df1['Fruit'] = np.where(df1.pop('set').eq(set(L)), 'Apple', df1.pop('first'))
print (df1)
ID Fruit
0 1 Apple
1 2 Strawberries
2 3 Blueberries
3 4 Blueberries
4 5 Blueberries