This is for a machine learning project.
I have a CSV file which I have read in as a Pandas dataframe. The CSV looks like this:
id,label f38a6374c348f90b587e046aac6079959adf3835,0 c18f2d887b7ae4f6742ee445113fa1aef383ed77,1 755db6279dae599ebb4d39a9123cce439965282d,0 bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0 068aba587a4950175d04c680d38943fd488d6a9d,0 acfe80838488fae3c89bd21ade75be5c34e66be7,0 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1 7f6ccae485af121e0b6ee733022e226ee6b0c65f,1 559e55a64c9ba828f700e948f6886f4cea919261,0 8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0 ... [220025 rows x 2 columns]
I have decreased the sample size and equalized the data, so that I have a dataframe with 60,000 rows; 30,000 rows with label 1 and label 0. I now want to split the dataframe into two with one dataframe having 50,000 rows, and the other having 10,000, but I want each dataframe to have an equal amount of rows with label 1 and label 0.
There are some longer solutions, such as splitting the dataframe, then using .frac()
to make two dataframes then merging alternate ones, but that is unnecessarily complex.
Is there any method to split the dataframe with equal amounts of rows for each label, but a different amount of total rows in each dataframe?
Here is the code I have used:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import os import cv2 import random df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv") ones_subset = df.loc[df["label"] == 1, :] num_ones = len(ones_subset) zeros_subset = df.loc[df["label"] == 0, :] sampled_zeros = zeros_subset.sample(num_ones) print(num_ones) print(sampled_zeros) df = pd.concat([ones_subset, sampled_zeros], ignore_index=True) df = df.groupby("label").sample(30000).sample(frac=1).reset_index(drop=True) print(df)
Advertisement
Answer
Try with sklearn
+ stratify
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.16, random_state=19, stratify=df['label'])