my problem is closely related to Groupby count only when a certain value is present in one of the column in pandas.
Let’s say I have a dataframe which is sorted by not_unique_id and date_of_call.
rng = pd.date_range('2015-02-24', periods=8, freq='D')
df = pd.DataFrame(
    {
    "unique_id": ["K0", "K1", "K2", "K3", "K4", "K5", "K6","K7"],
    "not_unique_id": ["A000", "A111", "A222", "A222", "A222", "A222", "A222","A333"],
    "date_of_call": rng,
    "customer_reached": [1,0,0,1,1,1,1,1],
    }
    ) 
df.sort_values(['not_unique_id','date_of_call'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True) # reset index
df.head(10)
Now I want to add a new column which tells me, how often the customer was called successfully in the past. In other words: count how many times to customer was reached in the past and save result in new column. For the provided example, this would be the result:
How would you do it?
Advertisement
Answer
If all datetimes are unique and sorted you can change order by indexing first and then aggregate custom function with shift and cumulative sum by Series.cumsum, last replace missing values and convert column to integers:
df['new'] = (df.iloc[::-1]
               .groupby('not_unique_id')['customer_reached']
               .apply(lambda x: x.shift().cumsum())
               .fillna(0)
               .astype(int))
print (df)
  unique_id not_unique_id date_of_call  customer_reached  new
0        K7          A333   2015-03-03                 1    0
1        K6          A222   2015-03-02                 1    3
2        K5          A222   2015-03-01                 1    2
3        K4          A222   2015-02-28                 1    1
4        K3          A222   2015-02-27                 1    0
5        K2          A222   2015-02-26                 0    0
6        K1          A111   2015-02-25                 0    0
7        K0          A000   2015-02-24                 1    0
Or if possible change order:
df.sort_values(['not_unique_id','date_of_call'], inplace=True)
df['new'] = (df.groupby('not_unique_id')['customer_reached']
               .apply(lambda x: x.shift().cumsum())
               .fillna(0)
               .astype(int))
print (df)
  unique_id not_unique_id date_of_call  customer_reached  new
0        K0          A000   2015-02-24                 1    0
1        K1          A111   2015-02-25                 0    0
2        K2          A222   2015-02-26                 0    0
3        K3          A222   2015-02-27                 1    0
4        K4          A222   2015-02-28                 1    1
5        K5          A222   2015-03-01                 1    2
6        K6          A222   2015-03-02                 1    3
7        K7          A333   2015-03-03                 1    0
 
						
