I have a dataframe (df
). It contains predicted
daily data from a model, up until the end of 2020. As each day passes in the year, actual
and id
data is added to the row. There are multiple names for each day
+------+-----+-----------+--------+------------+ | NAME | ID | PREDICTED | ACTUAL | YYYY_MM_DD | +------+-----+-----------+--------+------------+ | Nir | 215 | 100 | 400 | 2020-01-01 | | Nir | 215 | 200 | 400 | 2020-01-02 | | Nir | 215 | 100 | 400 | 2020-01-03 | | Nir | 215 | 200 | 400 | 2020-01-04 | | Nir | 215 | 100 | 400 | 2020-01-05 | | Nir | 215 | 200 | 400 | 2020-01-06 | | Nir | 215 | 100 | 400 | 2020-01-07 | | Nir | 215 | 200 | 400 | 2020-01-08 | | Nir | 215 | 100 | 400 | 2020-01-09 | | Nir | 215 | 200 | 400 | 2020-01-10 | | Nir | 215 | 100 | 400 | 2020-01-11 | | Nir | 215 | 200 | 400 | 2020-01-12 | | Nir | 215 | 100 | 400 | 2020-01-13 | | Nir | 215 | 200 | 400 | 2020-01-14 | | Nir | 215 | 100 | 400 | 2020-01-15 | | Nir | 215 | 200 | 400 | 2020-01-16 | | Nir | 215 | 100 | 400 | 2020-01-17 | | Nir | 215 | 200 | 400 | 2020-01-18 | | Nir | 215 | 100 | 400 | 2020-01-19 | | Nir | 215 | 200 | 400 | 2020-01-20 | | Nir | 215 | 100 | 400 | 2020-01-21 | | Nir | 215 | 200 | 400 | 2020-01-22 | | Nir | 215 | 100 | 400 | 2020-01-23 | | Nir | Nan | 100 | Nan | 2020-01-24 | | Nir | Nan | 100 | Nan | 2020-01-25 | | Nir | Nan | 100 | Nan | 2020-01-26 | | Nir | Nan | 100 | Nan | 2020-01-27 | | Nir | Nan | 100 | Nan | 2020-01-28 | | Nir | Nan | 100 | Nan | 2020-01-29 | | Nir | Nan | 100 | Nan | 2020-01-30 | | Nir | Nan | 100 | Nan | 2020-01-31 | | Xyc | 40 | 800 | 500 | 2020-01-01 | | Xyc | 40 | 100 | 500 | 2020-01-02 | | Xyc | 40 | 100 | 500 | 2020-01-03 | | Xyc | 40 | 100 | 500 | 2020-01-04 | | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | +------+-----+-----------+--------+------------+
I want to add an additional column named payout
. The payout
should be 0 unless the sum of actual
, month to date has passed the sum of predicted
.
I.e., for Nir
, we can see the sum of predicted
is 4200
. So the payout
should be 0 until the sum of actual
passes 4200. Once that threshold is passed, then the payout
should be 1% of actual-predicted
. With the above data, the output would look like this:
+------+-----+-----------+--------+---------------+--------+------------+ | NAME | ID | PREDICTED | ACTUAL | MONTH_TO_DATE | PAYOUT | YYYY_MM_DD | +------+-----+-----------+--------+---------------+--------+------------+ | Nir | 215 | 100 | 400 | 400 | 0 | 2020-01-01 | | Nir | 215 | 200 | 400 | 800 | 0 | 2020-01-02 | | Nir | 215 | 100 | 400 | 1200 | 0 | 2020-01-03 | | Nir | 215 | 200 | 400 | 1600 | 0 | 2020-01-04 | | Nir | 215 | 100 | 400 | 2000 | 0 | 2020-01-05 | | Nir | 215 | 200 | 400 | 2400 | 0 | 2020-01-06 | | Nir | 215 | 100 | 400 | 2800 | 0 | 2020-01-07 | | Nir | 215 | 200 | 400 | 3200 | 0 | 2020-01-08 | | Nir | 215 | 100 | 400 | 3600 | 0 | 2020-01-09 | | Nir | 215 | 200 | 400 | 4000 | 0 | 2020-01-10 | | Nir | 215 | 100 | 400 | 4400 | 3 | 2020-01-11 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-12 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-13 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-14 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-15 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-16 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-17 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-18 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-19 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-20 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-21 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-22 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-23 | | Nir | Nan | 100 | Nan | | | 2020-01-24 | | Nir | Nan | 100 | Nan | | | 2020-01-25 | | Nir | Nan | 100 | Nan | | | 2020-01-26 | | Nir | Nan | 100 | Nan | | | 2020-01-27 | | Nir | Nan | 100 | Nan | | | 2020-01-28 | | Nir | Nan | 100 | Nan | | | 2020-01-29 | | Nir | Nan | 100 | Nan | | | 2020-01-30 | | Nir | Nan | 100 | Nan | | | 2020-01-31 | | Xyc | 40 | 800 | 500 | 500 | 0 | 2020-01-01 | | Xyc | 40 | 100 | 500 | 1000 | 0 | 2020-01-02 | | Xyc | 40 | 100 | 500 | 1500 | 4 | 2020-01-03 | | Xyc | 40 | 100 | 500 | 2000 | 4 | 2020-01-04 | | ... | ... | ... | ... | | | ... | | ... | ... | ... | ... | | | ... | +------+-----+-----------+--------+---------------+--------+------------+
In the above output, Xyc
has a total predicted 2000, so payout
should be 0 until the sum of actual
passes 2000 also. In the real dataframe, there is daily data for ~70 name
s, so I feel like a grouping may be needed.
I’ve tried:
new_sum = [df.actual.values[0]] for i in range(1, len(df.index)): new_sum.append(new_sum[i-1]+df.actual.values[i]) df['actual_sum'] = new_sum
However, that simply gave me a running total of actual
. I also tried this:
df['inc'] = df['actual'] - df['predicted'] df['payout'] = np.where(df['inc']>=1, (df['inc'] / 100) * 1, 0)
But the above doesn’t make sure the month to date >= total for the month before attributing the 1%.
Advertisement
Answer
First You need to remove NaN rows from data.
Here You go:
import pandas as pd import numpy as np df = pd.DataFrame({'Name':['Nir','Nir','Nir','Nir','Xyc','Xyc','Xyc'],'PREDICTED':[100,200,100,200,100,200,300], 'ACTUAL':[400,400,400,400,500,500,500], 'YYYY_MM_DD':['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-01','2020-01-02','2020-01-03']}) def calculate(item): # select name data = df[df['Name'] == item] # calculate sum sum = data['PREDICTED'].sum() # remove NaN rows data = data.dropna() # calculate and insert month to date column values month_to_date = [] value = 0 for index, row in data.iterrows(): value += row['ACTUAL'] month_to_date.append(value) data.insert(3, "MONTH_TO_DATE", month_to_date, True) # calculate and instert payout values conditions = [ (data['MONTH_TO_DATE'] < sum), (data['MONTH_TO_DATE'] >= sum) ] choices = [0, ((data['ACTUAL'] - data['PREDICTED'])/100).astype(int)] data.insert(5, "PAYOUT", np.select(conditions, choices), True) return data # collect results results = pd.DataFrame(columns=['Name','PREDICTED','ACTUAL','MONTH_TO_DATE','YYYY_MM_DD','PAYOUT']) for item in df['Name'].unique(): df2 = calculate(item) results = results.append(df2)
Result:
Name PREDICTED ACTUAL MONTH_TO_DATE YYYY_MM_DD PAYOUT 0 Nir 100 400 400 2020-01-01 0 1 Nir 200 400 800 2020-01-02 2 2 Nir 100 400 1200 2020-01-03 3 3 Nir 200 400 1600 2020-01-04 2 4 Xyc 100 500 500 2020-01-01 0 5 Xyc 200 500 1000 2020-01-02 3 6 Xyc 300 500 1500 2020-01-03 2