Sampling data from the pandas dataframe

Question

I am trying to sample data from a big dataset. The dataset is like Code to generate a sample dataset The distribution of labels in the dataset is I created a new column in the dataset When I am trying to sample say 5000 items The distribution of the labels in the sampledf is not same as that in the

Accepted Answer

If you&#8217;re re-assigning frequency to the original dataframe, that&#8217;s probably the issue. Make sure you don&#8217;t have duplicate labels and weights going into your sampling.Using your summary data I can generate 5000 samples which do have (roughly) the same distribution as the original:In [1]: import pandas as pdIn [2]: summary = pd.DataFrame(   ...:    [   ...:        ['A', 0.350019],   ...:        ['B', 0.209966],   ...:        ['C', 0.126553],   ...:        ['D', 0.100983],   ...:        ['E', 0.053767],   ...:        ['F', 0.039378],   ...:        ['G', 0.029529],   ...:        ['H', 0.019056],   ...:        ['I', 0.016783],   ...:        ['J', 0.014813],   ...:        ['K', 0.014152],   ...:        ['L', 0.013477],   ...:        ['M', 0.009444],   ...:        ['N', 0.002082],   ...:    ],   ...:    columns=['label', 'freq']   ...: )You can sample from the summary table, weighting each unique label with the frequency in the original dataset:In [3]: summary.label.sample(   ...:     n=5000,   ...:     weights=summary.freq,   ...:     replace=True,   ...: ).value_counts(normalize=True)Out[3]:labelA    0.3448B    0.2198C    0.1356D    0.0952E    0.0488F    0.0322G    0.0284H    0.0234I    0.0168J    0.0162K    0.0146L    0.0140M    0.0090N    0.0012dtype: float64Alternatively, you could simply skip the calculation of the frequencies altogether &#8211; pandas will do this for you:In [7]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])In [8]: df.label.sample(5000, replace=True).value_counts(normalize=True)Out[8]:A    0.5994B    0.2930C    0.0576D    0.0500Name: label, dtype: float64The issue with the code in your question is that you end up weighting based on frequency and based on the explicit weights (which also account for frequency):In [2]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])In [3]: df['frequency'] = df.groupby('label')['label'].transform('count')In [4]: dfOut[4]:        label   frequency    0   A       11908    1   A       11908    2   B       5994    3   B       5994    4   D       1033  ...   ...     ...19995   A       1190819996   D       103319997   A       1190819998   A       1190819999   A       11908The result is roughly equal to the normalized square of each frequency:In [6]: freqs = np.array([0.6, 0.3, 0.05, 0.05])In [7]: (freqs ** 2) / (freqs ** 2).sum()Out[7]:array([0.79120879, 0.1978022 , 0.00549451, 0.00549451])

Advertisement

Answer