I am trying to sample data from a big dataset.
The dataset is like
id label 1 A 2 B 3 C 4 A .........
Code to generate a sample dataset
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N'] df = pd.DataFrame() N = 300000 weights = [0.350019, 0.209966, 0.126553, 0.100983, 0.053767, 0.039378, 0.029529, 0.019056, 0.016783, 0.014813, 0.014152, 0.013477, 0.009444, 0.002082] import random df['id'] = list(range(1, N+1)) df['label'] = list(random.choices(labels, weights=weights, k=N)) group_dict= df.groupby(['id']).apply(lambda x: list(set(x['label'].tolist()))[0]).to_dict() df = pd.DataFrame(group_dict.items()) df.columns= ['id','label']
The distribution of labels in the dataset is
df['label'].value_counts(normalize=True)
A 0.350373 B 0.209707 C 0.126307 D 0.101353 E 0.053917 F 0.039487 G 0.029217 H 0.018780 I 0.016510 J 0.015083 K 0.014323 L 0.013467 M 0.009530 N 0.001947
I created a new column in the dataset
df['freq'] = df.groupby('label')['label'].transform('count')
When I am trying to sample say 5000
items
sampledf = df.sample(n=5000, weights=df.freq, random_state=42)
The distribution of the labels in the sampledf
is not same as that in the df
A 0.6048 B 0.2198 C 0.0850 D 0.0544 E 0.0190 F 0.0082 G 0.0038 H 0.0020 I 0.0010 K 0.0008 L 0.0008 J 0.0004
I am not sure why the distribution is not the same as the actual data frame.
Can anybody help me with what I am missing here?
Thanks
Advertisement
Answer
If you’re re-assigning frequency to the original dataframe, that’s probably the issue. Make sure you don’t have duplicate labels and weights going into your sampling.
Using your summary data I can generate 5000 samples which do have (roughly) the same distribution as the original:
In [1]: import pandas as pd In [2]: summary = pd.DataFrame( ...: [ ...: ['A', 0.350019], ...: ['B', 0.209966], ...: ['C', 0.126553], ...: ['D', 0.100983], ...: ['E', 0.053767], ...: ['F', 0.039378], ...: ['G', 0.029529], ...: ['H', 0.019056], ...: ['I', 0.016783], ...: ['J', 0.014813], ...: ['K', 0.014152], ...: ['L', 0.013477], ...: ['M', 0.009444], ...: ['N', 0.002082], ...: ], ...: columns=['label', 'freq'] ...: )
You can sample from the summary table, weighting each unique label with the frequency in the original dataset:
In [3]: summary.label.sample( ...: n=5000, ...: weights=summary.freq, ...: replace=True, ...: ).value_counts(normalize=True) Out[3]: label A 0.3448 B 0.2198 C 0.1356 D 0.0952 E 0.0488 F 0.0322 G 0.0284 H 0.0234 I 0.0168 J 0.0162 K 0.0146 L 0.0140 M 0.0090 N 0.0012 dtype: float64
Alternatively, you could simply skip the calculation of the frequencies altogether – pandas will do this for you:
In [7]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"]) In [8]: df.label.sample(5000, replace=True).value_counts(normalize=True) Out[8]: A 0.5994 B 0.2930 C 0.0576 D 0.0500 Name: label, dtype: float64
The issue with the code in your question is that you end up weighting based on frequency and based on the explicit weights (which also account for frequency):
In [2]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"]) In [3]: df['frequency'] = df.groupby('label')['label'].transform('count') In [4]: df Out[4]: label frequency 0 A 11908 1 A 11908 2 B 5994 3 B 5994 4 D 1033 ... ... ... 19995 A 11908 19996 D 1033 19997 A 11908 19998 A 11908 19999 A 11908
The result is roughly equal to the normalized square of each frequency:
In [6]: freqs = np.array([0.6, 0.3, 0.05, 0.05]) In [7]: (freqs ** 2) / (freqs ** 2).sum() Out[7]: array([0.79120879, 0.1978022 , 0.00549451, 0.00549451])