Given the following dataframe df:
df = pd.DataFrame({'A':['Tony', 'Mike', 'Jen', 'Anna'], 'B': ['no', 'yes', 'no', 'yes']})
    A    B
0   Tony no 
1   Mike yes
2   Jen  no
3   Anna yes
I want to add another column that counts, progressively, the elements with df['B']='yes':
A B C 0 Tony no 0 1 Mike yes 1 2 Jen no 0 3 Anna yes 2
How can I do this?
Advertisement
Answer
You can use numpy.where with cumsum of boolean mask:
m = df['B']=='yes' df['C'] = np.where(m, m.cumsum(), 0)
Another solution is count boolean mask created by filtering and then add 0 values by reindex:
m = df['B']=='yes'
df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
print (df)
      A    B  C
0  Tony   no  0
1  Mike  yes  1
2   Jen   no  0
3  Anna  yes  2
Performance (in real data should be different, best check it first):
np.random.seed(123)
N = 10000
L = ['yes','no']
df = pd.DataFrame({'B': np.random.choice(L, N)})
print (df)
In [150]: %%timeit
     ...: m = df['B']=='yes'
     ...: df['C'] = np.where(m, m.cumsum(), 0)
     ...: 
1.57 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [151]: %%timeit
     ...: m = df['B']=='yes'
     ...: df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
     ...: 
2.53 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %%timeit
     ...: df['C'] = df.groupby('B').cumcount() + 1
     ...: df['C'].where(df['B'] == 'yes', 0, inplace=True)
4.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 
						