Suppose I have the following dataframe:
Type Name S2019 John S2019 Stephane S2019 Mike S2019 Hamid S2021 Rahim S2021 Ahamed
I want to groupby the dataset based on “Type” and then add a new column named as “Sampled” and randomly add yes/no to each row, the yes/no should be distributed equally. The expected dataframe can be:
Type Name Sampled S2019 John no S2019 Stephane yes S2019 Mike yes S2019 Hamid no S2021 Rahim yes S2021 Ahamed no
Advertisement
Answer
You can use numpy.random.choice
:
import numpy as np df['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))
output:
Type Name Sampled 0 S2019 John no 1 S2019 Stephane no 2 S2019 Mike yes 3 S2019 Hamid no 4 S2021 Rahim no 5 S2021 Ahamed yes
equal probability per group:
df['Sampled'] = (df.groupby('Type')['Type'] .transform(lambda g: np.random.choice(['yes', 'no'], size=len(g))) )
For each group, get an arbitrary column (here Type, but it doesn’t matter, this is just to have a shape of 1), and apply np.random.choice
with the length of the group as parameter. This gives as many yes or no as the number of items in the group with an equal probability (note that you can define a specific probability per item if you want).
NB. equal probability does not mean you will get necessarily 50/50 of yes/no, if this is what you want please clarify
half yes/no per group
If you want half each kind (yes/no) (±1 in case of odd size), you can select randomly half of the indices.
idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).index df['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')
NB. in case of odd number, there will be one more of the second item defined in the np.where
function, here “no”.
distribute equally many elements:
This will distribute equally, in the limit of multiplicity. This means, for 3 elements and 4 places, there will be two a, one b, one c in random order. If you want the extra item(s) to be chosen randomly, first shuffle the input.
elem = ['a', 'b', 'c'] df['Sampled'] = (df .groupby('Type', group_keys=False)['Type'] .transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)], size=len(g), replace=False)) )
output:
Type Name Sampled 0 S2019 John a 1 S2019 Stephane a 2 S2019 Mike b 3 S2019 Hamid c 4 S2021 Rahim a 5 S2021 Ahamed b