I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called ‘Split’ where Split = ['train','valid','test']
. I want 'train'
, 'valid'
, 'test'
to be distributed throughout 64%
, 16%
, and 20%
of the rows randomly, respectively.
I know of scikit learn’s train_test_split, but again, I don’t want new frames. So I could try:
from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.2)
but I just want a column ‘Split’ with values of train, valid, and test as labels. This is for machine learning purposes so I would like to make sure the splits are completely random.
Does anyone know how this may be possible?
Advertisement
Answer
Here’s one way, using the suggested numpy.random.choice
:
import pandas as pd import numpy as np # Set up a little example data = np.ones(shape=(100, 3)) df = pd.DataFrame(data, columns=['x1', 'x2', 'y']) df['split'] = pd.NA # Split split = ['train', 'valid', 'test'] df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20])) # Verify df['split'].value_counts()
For one given run, this yielded
train 64 valid 19 test 17 Name: split, dtype: int64