Skip to content
Advertisement

Pandas: Dataframe itertuples boolean series groupby optimization

I’m new in python. I have data frame (DF) example:

id type
1 A
1 B
2 C
2 B

I would like to add a column example A_flag group by id. In the end I have data frame (DF):

id type A_flag
1 A 1
1 B 1
2 C 0
2 B 0

I can do this in two step:

  • DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
  • DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)

It’s working, but it’s very slowy for big data frame. Is there any way to optimize this case ? Thank’s for help.

Advertisement

Answer

Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.

JavaScript

Then, you can attach it to the groupby statement for second step, as follows:

JavaScript

Result

JavaScript

In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:

JavaScript

Then, use it in step 2 as follows:

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement