Skip to content

How to take sample of data from very unbalanced DataFrame so as to not lose too many ‘1’?

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

  • My DataFrame is really large and unbalanced.
  • I need to make sampling on my DataFrame because it is really large
  • Balancing the DataFrame looks like this:
    • 99.60% – 0

    • 0.40 % – 1

      111 1
      222 1
      333 0
      444 1

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?



Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)
8 People found this is helpful