Dataframe increase speed of for loop for set value of column

Tags: , ,



I have dataframe from pandas (import pandas as pd)

print(df)
        C1  C2
    0    0   0
    1    0   0
    2    1   0
    3    1   1
    4    1   1
    5    0   1
    6    0   0
    7    0   0
    8    0   0
    9    1   0
    10   1   1
    11   1   1
    12   0   1
    13   0   0

I want count +1 in ‘C3’ after rising edges (rising edges start when C1 =1 and C2=0) I tried with iterrow()

count=1
df['C3'] = 0
for index, row in df.iterrows():
        if (row.C1 == 1) and (row.C2 == 0):
               count += 1 
               df.at[index, 'C3'] = count
        else:
               df.at[index, 'C3'] = count

    print(df)
         C1  C2  C3
    0    0   0   1
    1    0   0   1
    2    1   0   2
    3    1   1   2
    4    1   1   2
    5    0   1   2
    6    0   0   2
    7    0   0   2
    8    0   0   2
    9    1   0   3
    10   1   1   3
    11   1   1   3
    12   0   1   3
    13   0   0   3

for a dataframe with 300000 row, it’s little bit slow, does it have a simple way to make it more faster?

Thanks a lot for your help!

Answer

You can:

  • create a Series counts that is a boolean mask for the condition you want (counts);
  • add C3 to the original df with value 1 + counts.cumsum()

Note: pandas joins the series to the dataframe based on index values, not order. Doing some intermediate operation that mangles the series of the df or counts will produce unexpected results.

code:

counts = (df.C1 == 1) & (df.C2 == 0)
df["C3"] = 1 + counts.cumsum()

result:

    C1  C2  C3
0    0   0   1
1    0   0   1
2    1   0   2
3    1   1   2
4    1   1   2
5    0   1   2
6    0   0   2
7    0   0   2
8    0   0   2
9    1   0   3
10   1   1   3
11   1   1   3
12   0   1   3
13   0   0   3

Performance

Lets compare the performance of three options: iterrows, df.apply, and the vectorized solution above:

df = pd.DataFrame(dict(C1=np.random.choice(2,size=100000), C2=np.random.choice(2,size=100000)))

df1 = df.copy(deep=True)
df2 = df.copy(deep=True)
df3 = df.copy(deep=True)

def use_iterrows():
    count=1
    df1['C3'] = 0
    for index, row in df1.iterrows():
        if (row.C1 == 1) and (row.C2 == 0):
               count += 1 
               df1.at[index, 'C3'] = count
        else:
               df1.at[index, 'C3'] = count

def use_apply():
    df2['C3'] = df2.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1

def use_vectorized():
    counts = (df3.C1 == 1) & (df3.C2 == 0)
    df3["C3"] = 1 + counts.cumsum()

%timeit use_iterrows()
# 8.23 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    

%timeit use_apply()
# 1.54 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit use_vectorized()
# 1.28 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summary: using vectorized functions is by far the fastest (~1000x(!) faster for a df of 100k rows). I like to keep in the habit of using vectorized solutions when possible. The strength of df.apply is that it is very flexible and will work in situations that are difficult for a vectorized operation. I don’t think I’ve ever needed iterrows.



Source: stackoverflow