Skip to content
Advertisement

Add Categorical Column with Specific Count

I’m trying to create a new categorical column of countries with specific percentage values. Take the following dataset, for instance:

df = sns.load_dataset("titanic")

I’m trying the following script to get the new column:

country = ['UK', 'Ireland', 'France']

df["country"] = np.random.choice(country, len(df))

df["country"].value_counts(normalize=True)

UK         0.344557
Ireland    0.328844
France     0.326599

However, I’m getting all the countries with equal count. I want specific count for each country:

Desired Output

df["country"].value_counts(normalize=True)

UK         0.91
Ireland    0.06
France     0.03

What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!

Advertisement

Answer

Do you want to change the probabilities of numpy.random.choice?

df["country"] = np.random.choice(country, len(df), p=[0.91, 0.06, 0.03])
df["country"].value_counts(normalize=True)

Output:

UK         0.902357
Ireland    0.058361
France     0.039282
Name: country, dtype: float64

If you want a exact number of values (within the limit of the precision):

p = [0.91, 0.06, 0.03]
r = (np.array(p)*len(df)).round().astype(int) # the sum MUST be equal to len(df)
# or
# r = [811,  53,  27]

a = np.repeat(country, r)
np.random.shuffle(a)

df['country'] = a

df["country"].value_counts(normalize=True)

Output:

UK         0.910213
Ireland    0.059484
France     0.030303
Name: country, dtype: float64
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement