How to add randomly elements to a column of dataframe (Equally distributed to groups)

Question

Suppose I have the following dataframe: I want to groupby the dataset based on "Type" and then add a new column named as "Sampled" and randomly add yes/no to each row, the yes/no should be distributed equally. The expected dataframe can be: Answer You can use numpy.random.choice: output: equal probability per group: For each group, get an arbitrary column (here

Accepted Answer

You can use numpy.random.choice:import numpy as npdf['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))output:    Type      Name Sampled0  S2019      John      no1  S2019  Stephane      no2  S2019      Mike     yes3  S2019     Hamid      no4  S2021     Rahim      no5  S2021    Ahamed     yesequal probability per group:df['Sampled'] = (df.groupby('Type')['Type']                   .transform(lambda g: np.random.choice(['yes', 'no'],                                                         size=len(g)))                )For each group, get an arbitrary column (here Type, but it doesn&#8217;t matter, this is just to have a shape of 1), and apply np.random.choice with the length of the group as parameter. This gives as many yes or no as the number of items in the group with an equal probability (note that you can define a specific probability per item if you want).NB. equal probability does not mean you will get necessarily 50/50 of yes/no, if this is what you want please clarifyhalf yes/no per groupIf you want half each kind (yes/no) (±1 in case of odd size), you can select randomly half of the indices.idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).indexdf['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')NB. in case of odd number, there will be one more of the second item defined in the np.where function, here &#8220;no&#8221;.distribute equally many elements:This will distribute equally, in the limit of multiplicity. This means, for 3 elements and 4 places, there will be two a, one b, one c in random order. If you want the extra item(s) to be chosen randomly, first shuffle the input.elem = ['a', 'b', 'c']df['Sampled'] = (df.groupby('Type', group_keys=False)['Type'].transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)],                                      size=len(g), replace=False)))output:    Type      Name Sampled0  S2019      John       a1  S2019  Stephane       a2  S2019      Mike       b3  S2019     Hamid       c4  S2021     Rahim       a5  S2021    Ahamed       b

Advertisement

Answer

equal probability per group:

half yes/no per group

distribute equally many elements: