Skip to content
Advertisement

Sampling data from the pandas dataframe

I am trying to sample data from a big dataset.

The dataset is like

JavaScript

Code to generate a sample dataset

JavaScript

The distribution of labels in the dataset is

JavaScript
JavaScript

I created a new column in the dataset

JavaScript

When I am trying to sample say 5000 items

JavaScript

The distribution of the labels in the sampledf is not same as that in the df

JavaScript

I am not sure why the distribution is not the same as the actual data frame.

Can anybody help me with what I am missing here?

Thanks

Advertisement

Answer

If you’re re-assigning frequency to the original dataframe, that’s probably the issue. Make sure you don’t have duplicate labels and weights going into your sampling.

Using your summary data I can generate 5000 samples which do have (roughly) the same distribution as the original:

JavaScript

You can sample from the summary table, weighting each unique label with the frequency in the original dataset:

JavaScript

Alternatively, you could simply skip the calculation of the frequencies altogether – pandas will do this for you:

JavaScript

The issue with the code in your question is that you end up weighting based on frequency and based on the explicit weights (which also account for frequency):

JavaScript

The result is roughly equal to the normalized square of each frequency:

JavaScript
Advertisement