In a dataset like the one below, I’m trying to group the rows by attr_1 and attr_2, and if the sum of the count column exceeds a threshold (in this case 100), I want to keep the original rows.
account | attr_1 | attr_2 | count |
---|---|---|---|
ABC | X1 | Y1 | 25 |
DEF | X1 | Y1 | 100 |
ABC | X2 | Y2 | 150 |
DEF | X2 | Y2 | 0 |
ABC | X3 | Y3 | 10 |
DEF | X3 | Y3 | 15 |
I am using the messy approach below, but I’d like to see if there is a cleaner way that I could handle this.
JavaScript
x
18
18
1
df = pd.DataFrame({'account': ['ABC', 'DEF','ABC', 'DEF','ABC', 'DEF'],
2
'attr_1': ['X1', 'X1', 'X2', 'X2', 'X3', 'X3'],
3
'attr_2': ['Y1', 'Y1', 'Y2', 'Y2', 'Y3', 'Y3'],
4
'count': [25, 100, 150, 0, 10, 15]
5
})
6
7
min_count = 100
8
groups = df.groupby(by=['attr_1', 'attr_2']).sum()
9
group_count = groups.apply(lambda g: g[g >= min_count])
10
11
# find indices of groups exceed the threshold
12
keep_index = []
13
for ix in group_count.index:
14
keep_index.extend(df.query(f'attr_1=="{ix[0]}" & attr_2=="{ix[1]}"').index.values)
15
16
# filter dataframe
17
output_df = df[df.index.isin(keep_index)]
18
Advertisement
Answer
You can use groupby
+ filter
, and in the filter
lambda, provides a scalar condition for the group:
JavaScript
1
8
1
df.groupby(['attr_1', 'attr_2']).filter(lambda g: g['count'].sum() >= min_count)
2
3
account attr_1 attr_2 count
4
0 ABC X1 Y1 25
5
1 DEF X1 Y1 100
6
2 ABC X2 Y2 150
7
3 DEF X2 Y2 0
8
Or use groupby
+ transform
to create a filter condition that’s compatible with the original data frame:
JavaScript
1
8
1
df[df.groupby(['attr_1', 'attr_2'])['count'].transform('sum').ge(min_count)]
2
3
account attr_1 attr_2 count
4
0 ABC X1 Y1 25
5
1 DEF X1 Y1 100
6
2 ABC X2 Y2 150
7
3 DEF X2 Y2 0
8