I have a Pandas DataFrame like below with ID and Target variable (for machine learning model). My DataFrame is really large and unbalanced. I need to make sampling on my DataFrame because it is really large Balancing the DataFrame looks like this: 99.60% – 0 0.40 % – 1 ID TARGET 111 1 222 1 333 0 …

How to take sample of data from very unbalanced Da…

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

My DataFrame is really large and unbalanced.
I need to make sampling on my DataFrame because it is really large
Balancing the DataFrame looks like this:
- 99.60% – 0
- 0.40 % – 1
  
  ID TARGET
  
  111 1
  
  222 1
  
  333 0
  
  444 1
  
  … …

ID	TARGET
111	1
222	1
333	0
444	1
…	…

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?

Answer

Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)

How to take sample of data from very unbalanced DataFrame so as to not lose too many ‘1’?

Advertisement

Answer