Skip to content
Advertisement

Pandas: Dataframe itertuples boolean series groupby optimization

I’m new in python. I have data frame (DF) example:

id type
1 A
1 B
2 C
2 B

I would like to add a column example A_flag group by id. In the end I have data frame (DF):

id type A_flag
1 A 1
1 B 1
2 C 0
2 B 0

I can do this in two step:

  • DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
  • DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)

It’s working, but it’s very slowy for big data frame. Is there any way to optimize this case ? Thank’s for help.

Advertisement

Answer

Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.

df['type'].eq('A')

Then, you can attach it to the groupby statement for second step, as follows:

df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)

Result

print(df)


   id type  A_flag
0   1    A       1
1   1    B       1
2   2    C       0
3   2    B       0

In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:

m = df['type'].eq('A') & df['type1'].gt(1)  | (df['type2'] != 0)

Then, use it in step 2 as follows:

m.groupby(df['id']).transform('max').astype(int)    
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement