This is for a machine learning project. I have a CSV file which I have read in as a Pandas dataframe. The CSV looks like this: I have decreased the sample size and equalized the data, so that I have a dataframe with 60,000 rows; 30,000 rows with label 1 and label 0. I now want to split the dataframe

Split Pandas Dataframe With Equal Amount of Rows for each Column Value

This is for a machine learning project.
I have a CSV file which I have read in as a Pandas dataframe. The CSV looks like this:

id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
...
[220025 rows x 2 columns]

I have decreased the sample size and equalized the data, so that I have a dataframe with 60,000 rows; 30,000 rows with label 1 and label 0. I now want to split the dataframe into two with one dataframe having 50,000 rows, and the other having 10,000, but I want each dataframe to have an equal amount of rows with label 1 and label 0.

There are some longer solutions, such as splitting the dataframe, then using .frac() to make two dataframes then merging alternate ones, but that is unnecessarily complex.

Is there any method to split the dataframe with equal amounts of rows for each label, but a different amount of total rows in each dataframe?

Here is the code I have used:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import random

df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv")

ones_subset = df.loc[df["label"] == 1, :]
num_ones = len(ones_subset)

zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(num_ones)

print(num_ones)
print(sampled_zeros)

df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
df = df.groupby("label").sample(30000).sample(frac=1).reset_index(drop=True)
print(df)

Answer

Try with sklearn + stratify

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.16, random_state=19, stratify=df['label'])

Advertisement

Answer