I’m new in python. I have data frame (DF) example:
id | type |
---|---|
1 | A |
1 | B |
2 | C |
2 | B |
I would like to add a column example A_flag group by id. In the end I have data frame (DF):
id | type | A_flag |
---|---|---|
1 | A | 1 |
1 | B | 1 |
2 | C | 0 |
2 | B | 0 |
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It’s working, but it’s very slowy for big data frame. Is there any way to optimize this case ? Thank’s for help.
Advertisement
Answer
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df) id type A_flag 0 1 A 1 1 1 B 1 2 2 C 0 3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m
by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)