I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Advertisement
Answer
I would just use numpy’s randn
:
JavaScript
x
8
1
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
2
3
In [12]: msk = np.random.rand(len(df)) < 0.8
4
5
In [13]: train = df[msk]
6
7
In [14]: test = df[~msk]
8
And just to see this has worked:
JavaScript
1
6
1
In [15]: len(test)
2
Out[15]: 21
3
4
In [16]: len(train)
5
Out[16]: 79
6