Adding a column to Pandas Dataframe, randomly fill with values with percentage splits

Question

I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called 'Split' where Split = ['train','valid','test']. I want 'train', 'valid', 'test' to be distributed throughout 64%, 16%, and 20% of the rows randomly, respectively. I know of scikit learn's train_test_split,

Accepted Answer

Here&#8217;s one way, using the suggested numpy.random.choice:import pandas as pdimport numpy as np# Set up a little exampledata = np.ones(shape=(100, 3))df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])df['split'] = pd.NA# Splitsplit = ['train', 'valid', 'test']df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20]))# Verifydf['split'].value_counts()For one given run, this yieldedtrain    64valid    19test     17Name: split, dtype: int64

Advertisement

Answer