I have dataframe from pandas (import pandas as pd)
print(df) C1 C2 0 0 0 1 0 0 2 1 0 3 1 1 4 1 1 5 0 1 6 0 0 7 0 0 8 0 0 9 1 0 10 1 1 11 1 1 12 0 1 13 0 0
I want count +1 in ‘C3’ after rising edges (rising edges start when C1 =1 and C2=0) I tried with iterrow()
count=1 df['C3'] = 0 for index, row in df.iterrows(): if (row.C1 == 1) and (row.C2 == 0): count += 1 df.at[index, 'C3'] = count else: df.at[index, 'C3'] = count print(df) C1 C2 C3 0 0 0 1 1 0 0 1 2 1 0 2 3 1 1 2 4 1 1 2 5 0 1 2 6 0 0 2 7 0 0 2 8 0 0 2 9 1 0 3 10 1 1 3 11 1 1 3 12 0 1 3 13 0 0 3
for a dataframe with 300000 row, it’s little bit slow, does it have a simple way to make it more faster?
Thanks a lot for your help!
Advertisement
Answer
You can:
- create a Series
counts
that is a boolean mask for the condition you want (counts
); - add
C3
to the original df with value1 + counts.cumsum()
Note: pandas joins the series to the dataframe based on index values, not order. Doing some intermediate operation that mangles the series of the df or counts
will produce unexpected results.
code:
counts = (df.C1 == 1) & (df.C2 == 0) df["C3"] = 1 + counts.cumsum()
result:
C1 C2 C3 0 0 0 1 1 0 0 1 2 1 0 2 3 1 1 2 4 1 1 2 5 0 1 2 6 0 0 2 7 0 0 2 8 0 0 2 9 1 0 3 10 1 1 3 11 1 1 3 12 0 1 3 13 0 0 3
Performance
Lets compare the performance of three options: iterrows
, df.apply
, and the vectorized solution above:
df = pd.DataFrame(dict(C1=np.random.choice(2,size=100000), C2=np.random.choice(2,size=100000))) df1 = df.copy(deep=True) df2 = df.copy(deep=True) df3 = df.copy(deep=True) def use_iterrows(): count=1 df1['C3'] = 0 for index, row in df1.iterrows(): if (row.C1 == 1) and (row.C2 == 0): count += 1 df1.at[index, 'C3'] = count else: df1.at[index, 'C3'] = count def use_apply(): df2['C3'] = df2.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1 def use_vectorized(): counts = (df3.C1 == 1) & (df3.C2 == 0) df3["C3"] = 1 + counts.cumsum() %timeit use_iterrows() # 8.23 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit use_apply() # 1.54 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit use_vectorized() # 1.28 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Summary: using vectorized functions is by far the fastest (~1000x(!) faster for a df of 100k rows). I like to keep in the habit of using vectorized solutions when possible. The strength of df.apply
is that it is very flexible and will work in situations that are difficult for a vectorized operation. I don’t think I’ve ever needed iterrows.