I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I’ve got two pandas dataframes: one has these variables
Year Count Probability 1 8 25% 2 26 19% 3 17 26% 4 9 10%
Another is a table with these variables:
ID Value 1 100 2 25 3 50 4 15 5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) – and then create a new column in the first dataframe that represents the sum of the values in the loop.
So – just to demonstrate – in that first column, we’d loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time – so we might get output of:
0 100 0 0 25 0 0 0
…which sums to 125 – so we our added column to the first table looks like
Year Count Probability Sum 1 8 25% 125
….and so on. Thanks in advance.
Advertisement
Answer
We’ll use numpy binomial and pandas sample to get this done.
import pandas as pd import numpy as np # Set up dataframes vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']]) vals.columns = ['Year', 'Count', 'Probability'] temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]]) temp.columns = ['ID', 'Value'] # Get probability fraction from string vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100 # Total rows is binomial random variable with n=Count, p=Probability. vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability']) # Sample "total rows" from other DataFrame and sum. vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample( n=x, replace=True).sum()) # Drop intermediate rows vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True) print(vals) Year Count Probability Sum 0 1 8 25% 15 1 2 26 19% 350 2 3 17 26% 190 3 4 9 10% 0