Skip to content
Advertisement

Adding a column to Pandas Dataframe, randomly fill with values with percentage splits

I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called ‘Split’ where Split = ['train','valid','test']. I want 'train', 'valid', 'test' to be distributed throughout 64%, 16%, and 20% of the rows randomly, respectively.

I know of scikit learn’s train_test_split, but again, I don’t want new frames. So I could try:

JavaScript

but I just want a column ‘Split’ with values of train, valid, and test as labels. This is for machine learning purposes so I would like to make sure the splits are completely random.

Does anyone know how this may be possible?

Advertisement

Answer

Here’s one way, using the suggested numpy.random.choice:

JavaScript

For one given run, this yielded

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement