How do I get the original indices of the data when using train_test_split()?
What I have is the following
JavaScript
x
6
1
from sklearn.cross_validation import train_test_split
2
import numpy as np
3
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
4
labels = np.random.randint(2, size=10) # 10 labels
5
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)
6
But this does not give the indices of the original data.
One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]
) and then pass them inside train_test_split
and then expand again.
Are there any cleaner solutions?
Advertisement
Answer
Scikit learn plays really well with Pandas, so I suggest you use it. Here’s an example:
JavaScript
1
32
32
1
In [1]:
2
import pandas as pd
3
import numpy as np
4
from sklearn.model_selection import train_test_split
5
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
6
labels = np.random.randint(2, size=10) # 10 labels
7
8
In [2]: # Giving columns in X a name
9
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
10
y = pd.Series(labels)
11
12
In [3]:
13
X_train, X_test, y_train, y_test = train_test_split(X, y,
14
test_size=0.2,
15
random_state=0)
16
17
In [4]: X_test
18
Out[4]:
19
20
Column_1 Column_2
21
2 -1.39 -1.86
22
8 0.48 -0.81
23
4 -0.10 -1.83
24
25
In [5]: y_test
26
Out[5]:
27
28
2 1
29
8 1
30
4 1
31
dtype: int32
32
You can directly call any scikit functions on DataFrame/Series and it will work.
Let’s say you wanted to do a LogisticRegression, here’s how you could retrieve the coefficients in a nice way:
JavaScript
1
14
14
1
In [6]:
2
from sklearn.linear_model import LogisticRegression
3
4
model = LogisticRegression()
5
model = model.fit(X_train, y_train)
6
7
# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)
8
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
9
df_coefs
10
Out[6]:
11
Coefficient
12
Column_1 0.076987
13
Column_2 -0.352463
14