I’m trying to use groupby and agg() function for this data processing step:
Input:
ID Fruit 1 Apples 1 Oranges 2 Strawberries 3 Blueberries 4 Blueberries 4 Apples
I plan to aggregate the data by ID. The requirement is if apples and oranges show up for the same ID, keep ‘Apples’; for other combinations, keep the first observation for each ID. So wanted output:
ID Fruit 1 Apples 2 Strawberries 3 Blueberries 4 Blueberries
I could pivot the table and use np.where; however, in this particular situation, I’d prefer to do it in a long format. Please help me with this data processing step. Thank you.
Advertisement
Answer
Use:
#sample data print (df) ID Fruit 0 1 Apples 1 1 Oranges 2 2 Strawberries 3 3 Blueberries 4 4 Blueberries 5 4 Apples 6 5 Blueberries 7 5 Oranges 8 5 Apples
You can aggregate set
and GroupBy.first
:
L = ['Apples','Oranges'] df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index() print (df1) ID set first 0 1 {Oranges, Apples} Apples 1 2 {Strawberries} Strawberries 2 3 {Blueberries} Blueberries 3 4 {Blueberries, Apples} Blueberries 4 5 {Oranges, Blueberries, Apples} Blueberries
And then test if all categories from list L
exist in column set
by Series.ge
also possible another categories like ID=5
– then is assigned Apple
else values from column first
:
df1['Fruit'] = np.where(df1.pop('set').ge(set(L)), 'Apple', df1.pop('first')) print (df1) ID Fruit 0 1 Apple 1 2 Strawberries 2 3 Blueberries 3 4 Blueberries 4 5 Apple
Or if need test if exist only unique values from list converted to sets use Series.eq
:
L = ['Apples','Oranges'] df1 = df.groupby('ID', sort=False)['Fruit'].agg([set, 'first']).reset_index() print (df1) ID set first 0 1 {Oranges, Apples} Apples 1 2 {Strawberries} Strawberries 2 3 {Blueberries} Blueberries 3 4 {Blueberries, Apples} Blueberries 4 5 {Oranges, Blueberries, Apples} Blueberries df1['Fruit'] = np.where(df1.pop('set').eq(set(L)), 'Apple', df1.pop('first')) print (df1) ID Fruit 0 1 Apple 1 2 Strawberries 2 3 Blueberries 3 4 Blueberries 4 5 Blueberries