Scikit-learn train_test_split with indices

Question

How do I get the original indices of the data when using train_test_split()? What I have is the following But this does not give the indices of the original data. One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]) and then pass them inside train_test_split and then expand again. Are

Accepted Answer

Scikit learn plays really well with Pandas, so I suggest you use it. Here&#8217;s an example:In [1]: import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitdata = np.reshape(np.random.randn(20),(10,2)) # 10 training exampleslabels = np.random.randint(2, size=10) # 10 labelsIn [2]: # Giving columns in X a nameX = pd.DataFrame(data, columns=['Column_1', 'Column_2'])y = pd.Series(labels)In [3]:X_train, X_test, y_train, y_test = train_test_split(X, y,                                                     test_size=0.2,                                                     random_state=0)In [4]: X_testOut[4]:     Column_1    Column_22   -1.39       -1.868    0.48       -0.814   -0.10       -1.83In [5]: y_testOut[5]:2    18    14    1dtype: int32You can directly call any scikit functions on DataFrame/Series and it will work.Let&#8217;s say you wanted to do a LogisticRegression, here&#8217;s how you could retrieve the coefficients in a nice way:In [6]: from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model = model.fit(X_train, y_train)# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])df_coefsOut[6]:            CoefficientColumn_1    0.076987Column_2    -0.352463

Advertisement

Answer