Adding a column to Pandas Dataframe, randomly fill with values with percentage splits

Question

I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called &#8216;Split&#8217; where Split = [&#8216;train&#8217;,&#8217;valid&#8217;,&#8217;test&#8217;]. I want &#8216;train&#8217;, &#8216;valid&#8217;, &#8216;test&#8217; …

Accepted Answer

Here&#8217;s one way, using the suggested numpy.random.choice:import pandas as pdimport numpy as np# Set up a little exampledata = np.ones(shape=(100, 3))df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])df['split'] = pd.NA# Splitsplit = ['train', 'valid', 'test']df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20]))# Verifydf['split'].value_counts()For one given run, this yieldedtrain    64valid    19test     17Name: split, dtype: int64

Advertisement

Answer