Sklearn RFE, pipeline and cross validation

Question

I'm trying to figure out how to use RFE for regression problems, and I was reading some tutorials. I found an example on how to use RFECV to automatically select the ideal number of features, and it goes something like: which I find pretty straightforward. However, I was checking how to do the same thing using a RFE object, but

Accepted Answer

Well, first, let&#8217;s point it out that RFECV and RFE are doing two separate jobs in your script: the former is selecting the optimal number of features, while the latter is selecting the most five important features (or, the best combination of 5 features, given their importance for the DecisionTreeRegressor).Back to your question: &#8220;When did the RFE pass the information about which features have been selected to the Decision Tree?&#8221; It is worth noting that the RFE does not explicitly tell the Decision Tree which features are selected. Simply, it takes a matrix as input (the training set) and transforms it in a matrix of N columns, based on the n_features_to_select=N parameter.That matrix (i.e., transformed training set) is passed as input to the Decision Tree, along with the target variable, which returns a fitted model that can be used to predict unseen instances.Let&#8217;s dive into an example for classification:""" Import dependencies and load data """import numpy as npimport pandas as pdfrom sklearn.datasets import load_breast_cancerfrom sklearn.feature_selection import RFEfrom sklearn.metrics import precision_scorefrom sklearn.tree import DecisionTreeClassifierX, y = load_breast_cancer(return_X_y=True)rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=2)We have now loaded the breast_cancer dataset and instantiated a RFE object (I used a DecisionTreeClassifier, but other algorithms can be used as well).To see how the training data is handled within a pipeline, let&#8217;s start with a manual example that show how a pipeline would works if decomposed in its &#8220;basic steps&#8221;:from sklearn.model_selection import train_test_splitdef test_and_train(X, y, random_state):    # For simplicity, let's use 80%-20% splitting    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)    # Fit and transform the training data by applying Recursive Feature Elimination    X_train_transformed = rfe.fit_transform(X_train, y_train)    # Transform the testing data to select the same features    X_test_transformed = rfe.transform(X_test)      print(X_train[0:3])    print(X_train_transformed[0:3])    print(X_test_transformed[0:3])    # Train on the transformed trained data    fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)    # Predict on the transformed testing data    y_pred = fitted_model.predict(X_test_transformed)    print('True labels: ', y_test)    print('Predicted labels:', y_pred)    return y_test, y_predprecisions = list() # to store the precision scores (can be replaced by any other evaluation measure)y_test, y_pred = test_and_train(X, y, 42)precisions.append(precision_score(y_test, y_pred))y_test, y_pred = test_and_train(X, y, 84)precisions.append(precision_score(y_test, y_pred))y_test, y_pred = test_and_train(X, y, 168)precisions.append(precision_score(y_test, y_pred))print('Average precision:', np.mean(precisions))"""Average precision: 0.92"""In the above script, we created a function that, given a dataset X and a target variable yCreates a training and testing set following the 80%-20% splitting rule.Transforms them using RFE (i.e., selects the best 2 features, as specified in the former code snippet). While calling fit_transform on the RFE, it runs the Recursive Feature Elimination, and it saves information about the selected features in its object state. To know which are the selected features, call rfe.support_.Note: on the testing set only transform is executed, so that the features in rfe.support_ are used to filter out other features from the testing set.Fits a model and return a tuple (y_test, y_pred).The y_test and y_pred can be used to analyze the performance of the model, e.g., its precision.The precision in saved in an array, and the procedure is repeated 3 times.Finally, we print the average precision.We simulated a cross-validation procedure, by splitting the original data 3 times in their respective training and testing set, fitted a model, computed and averaged its performance (i.e., precision) across the three folds.This process can be simplified using a RepeatedKFold validation:from sklearn.model_selection import RepeatedKFoldprecisions = list()rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)for train_index, test_index in rkf.split(X, y):    # print("TRAIN:", train_index, "TEST:", test_index)    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        X_train_transformed = rfe.fit_transform(X_train, y_train)    X_test_transformed = rfe.transform(X_test)        fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)    y_pred = fitted_model.predict(X_test_transformed)    precisions.append(precision_score(y_test, y_pred))print('Average precision:', np.mean(precisions))"""Average precision: 0.93"""and even further with Pipeline:from sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_scorerkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)pipeline = Pipeline(steps=[('s',rfe),('m',DecisionTreeClassifier())])precisions = cross_val_score(pipeline, X, y, scoring='precision', cv=rkf)print('Average precision:', np.mean(precisions))"""Average precision: 0.93"""In summary, when the original data is passed to the Pipeline, the latter:splits it in training and testing data;calls RFE.fit_transform() on the training data;applies RFE.transform() on the testing data so that it consists of the same features;calls estimator.fit() on the training data to fit (i.e., train) a model;calls estimator.predict() on the testing data to predict it.compares the predictions with the actual values and save the performance results (the one you passed to the scoring parameter) internally.Repeats steps 1-6 for every split in the cross-validation procedureAt the end of the procedure, someone can access the performance results and average them across the folds.

Advertisement

Answer