I’m trying to create a new categorical column of countries with specific percentage values. Take the following dataset, for instance:
df = sns.load_dataset("titanic")
I’m trying the following script to get the new column:
country = ['UK', 'Ireland', 'France'] df["country"] = np.random.choice(country, len(df)) df["country"].value_counts(normalize=True) UK 0.344557 Ireland 0.328844 France 0.326599
However, I’m getting all the countries with equal count. I want specific count for each country:
Desired Output
df["country"].value_counts(normalize=True) UK 0.91 Ireland 0.06 France 0.03
What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!
Advertisement
Answer
Do you want to change the probabilities of numpy.random.choice
?
df["country"] = np.random.choice(country, len(df), p=[0.91, 0.06, 0.03]) df["country"].value_counts(normalize=True)
Output:
UK 0.902357 Ireland 0.058361 France 0.039282 Name: country, dtype: float64
If you want a exact number of values (within the limit of the precision):
p = [0.91, 0.06, 0.03] r = (np.array(p)*len(df)).round().astype(int) # the sum MUST be equal to len(df) # or # r = [811, 53, 27] a = np.repeat(country, r) np.random.shuffle(a) df['country'] = a df["country"].value_counts(normalize=True)
Output:
UK 0.910213 Ireland 0.059484 France 0.030303 Name: country, dtype: float64