Skip to content
Advertisement

How to add randomly elements to a column of dataframe (Equally distributed to groups)

Suppose I have the following dataframe:

Type    Name
S2019   John
S2019   Stephane
S2019   Mike
S2019   Hamid
S2021   Rahim
S2021   Ahamed

I want to groupby the dataset based on “Type” and then add a new column named as “Sampled” and randomly add yes/no to each row, the yes/no should be distributed equally. The expected dataframe can be:

Type    Name    Sampled
S2019   John    no
S2019   Stephane    yes
S2019   Mike    yes
S2019   Hamid   no
S2021   Rahim   yes
S2021   Ahamed  no

Advertisement

Answer

You can use numpy.random.choice:

import numpy as np
df['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))

output:

    Type      Name Sampled
0  S2019      John      no
1  S2019  Stephane      no
2  S2019      Mike     yes
3  S2019     Hamid      no
4  S2021     Rahim      no
5  S2021    Ahamed     yes

equal probability per group:

df['Sampled'] = (df.groupby('Type')['Type']
                   .transform(lambda g: np.random.choice(['yes', 'no'],
                                                         size=len(g)))
                )

For each group, get an arbitrary column (here Type, but it doesn’t matter, this is just to have a shape of 1), and apply np.random.choice with the length of the group as parameter. This gives as many yes or no as the number of items in the group with an equal probability (note that you can define a specific probability per item if you want).

NB. equal probability does not mean you will get necessarily 50/50 of yes/no, if this is what you want please clarify

half yes/no per group

If you want half each kind (yes/no) (±1 in case of odd size), you can select randomly half of the indices.

idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).index

df['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')

NB. in case of odd number, there will be one more of the second item defined in the np.where function, here “no”.

distribute equally many elements:

This will distribute equally, in the limit of multiplicity. This means, for 3 elements and 4 places, there will be two a, one b, one c in random order. If you want the extra item(s) to be chosen randomly, first shuffle the input.

elem = ['a', 'b', 'c']
df['Sampled'] = (df
.groupby('Type', group_keys=False)['Type']
.transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)],
                                      size=len(g), replace=False))
)

output:

    Type      Name Sampled
0  S2019      John       a
1  S2019  Stephane       a
2  S2019      Mike       b
3  S2019     Hamid       c
4  S2021     Rahim       a
5  S2021    Ahamed       b
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement