Skip to content
Advertisement

Looping through a second column using a probability input

I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:

Using a Python pandas dataframe column as input to a loop through another column

I’ve got two pandas dataframes: one has these variables

Year  Count   Probability
 1      8         25%
 2      26        19%
 3      17        26%
 4      9         10%

Another is a table with these variables:

ID   Value
1     100
2      25
3      50 
4      15
5      75

Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) – and then create a new column in the first dataframe that represents the sum of the values in the loop.

So – just to demonstrate – in that first column, we’d loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time – so we might get output of:

0 100 0 0 25 0 0 0

…which sums to 125 – so we our added column to the first table looks like

Year Count  Probability  Sum
 1     8        25%      125

….and so on. Thanks in advance.

Advertisement

Answer

We’ll use numpy binomial and pandas sample to get this done.

import pandas as pd
import numpy as np

# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']

# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100

# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])

# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
    n=x, replace=True).sum())

# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)

print(vals)

   Year  Count Probability  Sum
0     1      8         25%   15
1     2     26         19%  350
2     3     17         26%  190
3     4      9         10%    0
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement