I have a dataframe (df). It contains predicted daily data from a model, up until the end of 2020. As each day passes in the year, actual and id data is added to the row. There are multiple names for each day
+------+-----+-----------+--------+------------+ | NAME | ID | PREDICTED | ACTUAL | YYYY_MM_DD | +------+-----+-----------+--------+------------+ | Nir | 215 | 100 | 400 | 2020-01-01 | | Nir | 215 | 200 | 400 | 2020-01-02 | | Nir | 215 | 100 | 400 | 2020-01-03 | | Nir | 215 | 200 | 400 | 2020-01-04 | | Nir | 215 | 100 | 400 | 2020-01-05 | | Nir | 215 | 200 | 400 | 2020-01-06 | | Nir | 215 | 100 | 400 | 2020-01-07 | | Nir | 215 | 200 | 400 | 2020-01-08 | | Nir | 215 | 100 | 400 | 2020-01-09 | | Nir | 215 | 200 | 400 | 2020-01-10 | | Nir | 215 | 100 | 400 | 2020-01-11 | | Nir | 215 | 200 | 400 | 2020-01-12 | | Nir | 215 | 100 | 400 | 2020-01-13 | | Nir | 215 | 200 | 400 | 2020-01-14 | | Nir | 215 | 100 | 400 | 2020-01-15 | | Nir | 215 | 200 | 400 | 2020-01-16 | | Nir | 215 | 100 | 400 | 2020-01-17 | | Nir | 215 | 200 | 400 | 2020-01-18 | | Nir | 215 | 100 | 400 | 2020-01-19 | | Nir | 215 | 200 | 400 | 2020-01-20 | | Nir | 215 | 100 | 400 | 2020-01-21 | | Nir | 215 | 200 | 400 | 2020-01-22 | | Nir | 215 | 100 | 400 | 2020-01-23 | | Nir | Nan | 100 | Nan | 2020-01-24 | | Nir | Nan | 100 | Nan | 2020-01-25 | | Nir | Nan | 100 | Nan | 2020-01-26 | | Nir | Nan | 100 | Nan | 2020-01-27 | | Nir | Nan | 100 | Nan | 2020-01-28 | | Nir | Nan | 100 | Nan | 2020-01-29 | | Nir | Nan | 100 | Nan | 2020-01-30 | | Nir | Nan | 100 | Nan | 2020-01-31 | | Xyc | 40 | 800 | 500 | 2020-01-01 | | Xyc | 40 | 100 | 500 | 2020-01-02 | | Xyc | 40 | 100 | 500 | 2020-01-03 | | Xyc | 40 | 100 | 500 | 2020-01-04 | | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | +------+-----+-----------+--------+------------+
I want to add an additional column named payout. The payout should be 0 unless the sum of actual, month to date has passed the sum of predicted.
I.e., for Nir, we can see the sum of predicted is 4200. So the payout should be 0 until the sum of actual passes 4200. Once that threshold is passed, then the payout should be 1% of actual-predicted. With the above data, the output would look like this:
+------+-----+-----------+--------+---------------+--------+------------+ | NAME | ID | PREDICTED | ACTUAL | MONTH_TO_DATE | PAYOUT | YYYY_MM_DD | +------+-----+-----------+--------+---------------+--------+------------+ | Nir | 215 | 100 | 400 | 400 | 0 | 2020-01-01 | | Nir | 215 | 200 | 400 | 800 | 0 | 2020-01-02 | | Nir | 215 | 100 | 400 | 1200 | 0 | 2020-01-03 | | Nir | 215 | 200 | 400 | 1600 | 0 | 2020-01-04 | | Nir | 215 | 100 | 400 | 2000 | 0 | 2020-01-05 | | Nir | 215 | 200 | 400 | 2400 | 0 | 2020-01-06 | | Nir | 215 | 100 | 400 | 2800 | 0 | 2020-01-07 | | Nir | 215 | 200 | 400 | 3200 | 0 | 2020-01-08 | | Nir | 215 | 100 | 400 | 3600 | 0 | 2020-01-09 | | Nir | 215 | 200 | 400 | 4000 | 0 | 2020-01-10 | | Nir | 215 | 100 | 400 | 4400 | 3 | 2020-01-11 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-12 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-13 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-14 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-15 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-16 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-17 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-18 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-19 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-20 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-21 | | Nir | 215 | 200 | 400 | ... | 2 | 2020-01-22 | | Nir | 215 | 100 | 400 | ... | 3 | 2020-01-23 | | Nir | Nan | 100 | Nan | | | 2020-01-24 | | Nir | Nan | 100 | Nan | | | 2020-01-25 | | Nir | Nan | 100 | Nan | | | 2020-01-26 | | Nir | Nan | 100 | Nan | | | 2020-01-27 | | Nir | Nan | 100 | Nan | | | 2020-01-28 | | Nir | Nan | 100 | Nan | | | 2020-01-29 | | Nir | Nan | 100 | Nan | | | 2020-01-30 | | Nir | Nan | 100 | Nan | | | 2020-01-31 | | Xyc | 40 | 800 | 500 | 500 | 0 | 2020-01-01 | | Xyc | 40 | 100 | 500 | 1000 | 0 | 2020-01-02 | | Xyc | 40 | 100 | 500 | 1500 | 4 | 2020-01-03 | | Xyc | 40 | 100 | 500 | 2000 | 4 | 2020-01-04 | | ... | ... | ... | ... | | | ... | | ... | ... | ... | ... | | | ... | +------+-----+-----------+--------+---------------+--------+------------+
In the above output, Xyc has a total predicted 2000, so payout should be 0 until the sum of actual passes 2000 also. In the real dataframe, there is daily data for ~70 names, so I feel like a grouping may be needed.
I’ve tried:
new_sum = [df.actual.values[0]] for i in range(1, len(df.index)):
new_sum.append(new_sum[i-1]+df.actual.values[i])
df['actual_sum'] = new_sum
However, that simply gave me a running total of actual. I also tried this:
df['inc'] = df['actual'] - df['predicted'] df['payout'] = np.where(df['inc']>=1, (df['inc'] / 100) * 1, 0)
But the above doesn’t make sure the month to date >= total for the month before attributing the 1%.
Advertisement
Answer
First You need to remove NaN rows from data.
Here You go:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Nir','Nir','Nir','Nir','Xyc','Xyc','Xyc'],'PREDICTED':[100,200,100,200,100,200,300],
'ACTUAL':[400,400,400,400,500,500,500],
'YYYY_MM_DD':['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-01','2020-01-02','2020-01-03']})
def calculate(item):
# select name
data = df[df['Name'] == item]
# calculate sum
sum = data['PREDICTED'].sum()
# remove NaN rows
data = data.dropna()
# calculate and insert month to date column values
month_to_date = []
value = 0
for index, row in data.iterrows():
value += row['ACTUAL']
month_to_date.append(value)
data.insert(3, "MONTH_TO_DATE", month_to_date, True)
# calculate and instert payout values
conditions = [
(data['MONTH_TO_DATE'] < sum),
(data['MONTH_TO_DATE'] >= sum)
]
choices = [0, ((data['ACTUAL'] - data['PREDICTED'])/100).astype(int)]
data.insert(5, "PAYOUT", np.select(conditions, choices), True)
return data
# collect results
results = pd.DataFrame(columns=['Name','PREDICTED','ACTUAL','MONTH_TO_DATE','YYYY_MM_DD','PAYOUT'])
for item in df['Name'].unique():
df2 = calculate(item)
results = results.append(df2)
Result:
Name PREDICTED ACTUAL MONTH_TO_DATE YYYY_MM_DD PAYOUT 0 Nir 100 400 400 2020-01-01 0 1 Nir 200 400 800 2020-01-02 2 2 Nir 100 400 1200 2020-01-03 3 3 Nir 200 400 1600 2020-01-04 2 4 Xyc 100 500 500 2020-01-01 0 5 Xyc 200 500 1000 2020-01-02 3 6 Xyc 300 500 1500 2020-01-03 2