I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).
- My DataFrame is really large and unbalanced.
- I need to make sampling on my DataFrame because it is really large
- Balancing the DataFrame looks like this:
99.60% – 0
0.40 % – 1
ID TARGET 111 1 222 1 333 0 444 1 … …
How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.
How can I do that in Python ?
Advertisement
Answer
Assume you want a sample size = 1000
Try to use the following line :
df.sample(frac=1000/len(df), replace=True, random_state=1)