I’m new in python. I have data frame (DF) example:
id | type |
---|---|
1 | A |
1 | B |
2 | C |
2 | B |
I would like to add a column example A_flag group by id. In the end I have data frame (DF):
id | type | A_flag |
---|---|---|
1 | A | 1 |
1 | B | 1 |
2 | C | 0 |
2 | B | 0 |
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It’s working, but it’s very slowy for big data frame. Is there any way to optimize this case ? Thank’s for help.
Advertisement
Answer
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
JavaScript
x
2
1
df['type'].eq('A')
2
Then, you can attach it to the groupby statement for second step, as follows:
JavaScript
1
2
1
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
2
Result
JavaScript
1
9
1
print(df)
2
3
4
id type A_flag
5
0 1 A 1
6
1 1 B 1
7
2 2 C 0
8
3 2 B 0
9
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m
by:
JavaScript
1
2
1
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
2
Then, use it in step 2 as follows:
JavaScript
1
2
1
m.groupby(df['id']).transform('max').astype(int)
2