I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I’ve got two pandas dataframes: one has these variables
Year Count Probability
1 8 25%
2 26 19%
3 17 26%
4 9 10%
Another is a table with these variables:
ID Value
1 100
2 25
3 50
4 15
5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) – and then create a new column in the first dataframe that represents the sum of the values in the loop.
So – just to demonstrate – in that first column, we’d loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time – so we might get output of:
0 100 0 0 25 0 0 0
…which sums to 125 – so we our added column to the first table looks like
Year Count Probability Sum
1 8 25% 125
….and so on. Thanks in advance.
Advertisement
Answer
We’ll use numpy binomial and pandas sample to get this done.
import pandas as pd
import numpy as np
# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']
# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100
# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])
# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
n=x, replace=True).sum())
# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)
print(vals)
Year Count Probability Sum
0 1 8 25% 15
1 2 26 19% 350
2 3 17 26% 190
3 4 9 10% 0