I’m trying to figure out how to use RFE for regression problems, and I was reading some tutorials.
I found an example on how to use RFECV to automatically select the ideal number of features, and it goes something like:
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import StratifiedKFold from sklearn.feature_selection import RFECV rfecv = RFECV(estimator=RandomForestClassifier(random_state=101), step=1, cv=StratifiedKFold(10), scoring='accuracy') rfecv.fit(X, target) print(np.where(rfecv.support_ == False)[0])
which I find pretty straightforward.
However, I was checking how to do the same thing using a RFE object, but in order to include cross-validation I only found solutions involving the use of pipelines, like:
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # create pipeline rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5) model = DecisionTreeRegressor() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # evaluate model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print(f'MAE: {mean(n_scores):.3f}')
I’m not sure about what precisely is happening here. The pipeline is used to queue the RFE algorithm and the second DecisionTreeRegressor (model). If I’m not wrong, the idea is that for every iteration in the cross-validation, the RFE is executed, the desired number of best features is selected, and then the second model is run using only those features. But how/when did the RFE pass the information about which features have been selected to the DecisionTreeRegressor? Did it even happen, or is the code missing this part?
Advertisement
Answer
Well, first, let’s point it out that RFECV and RFE are doing two separate jobs in your script: the former is selecting the optimal number of features, while the latter is selecting the most five important features (or, the best combination of 5 features, given their importance for the DecisionTreeRegressor).
Back to your question: “When did the RFE pass the information about which features have been selected to the Decision Tree?” It is worth noting that the RFE does not explicitly tell the Decision Tree which features are selected. Simply, it takes a matrix as input (the training set) and transforms it in a matrix of N columns, based on the n_features_to_select=N
parameter.
That matrix (i.e., transformed training set) is passed as input to the Decision Tree, along with the target variable, which returns a fitted model that can be used to predict unseen instances.
Let’s dive into an example for classification:
""" Import dependencies and load data """ import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.feature_selection import RFE from sklearn.metrics import precision_score from sklearn.tree import DecisionTreeClassifier X, y = load_breast_cancer(return_X_y=True) rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=2)
We have now loaded the breast_cancer dataset and instantiated a RFE object (I used a DecisionTreeClassifier, but other algorithms can be used as well).
To see how the training data is handled within a pipeline, let’s start with a manual example that show how a pipeline would works if decomposed in its “basic steps”:
from sklearn.model_selection import train_test_split def test_and_train(X, y, random_state): # For simplicity, let's use 80%-20% splitting X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state) # Fit and transform the training data by applying Recursive Feature Elimination X_train_transformed = rfe.fit_transform(X_train, y_train) # Transform the testing data to select the same features X_test_transformed = rfe.transform(X_test) print(X_train[0:3]) print(X_train_transformed[0:3]) print(X_test_transformed[0:3]) # Train on the transformed trained data fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train) # Predict on the transformed testing data y_pred = fitted_model.predict(X_test_transformed) print('True labels: ', y_test) print('Predicted labels:', y_pred) return y_test, y_pred precisions = list() # to store the precision scores (can be replaced by any other evaluation measure) y_test, y_pred = test_and_train(X, y, 42) precisions.append(precision_score(y_test, y_pred)) y_test, y_pred = test_and_train(X, y, 84) precisions.append(precision_score(y_test, y_pred)) y_test, y_pred = test_and_train(X, y, 168) precisions.append(precision_score(y_test, y_pred)) print('Average precision:', np.mean(precisions)) """ Average precision: 0.92 """
In the above script, we created a function that, given a dataset X
and a target variable y
- Creates a training and testing set following the 80%-20% splitting rule.
- Transforms them using RFE (i.e., selects the best 2 features, as specified in the former code snippet). While calling
fit_transform
on the RFE, it runs the Recursive Feature Elimination, and it saves information about the selected features in its object state. To know which are the selected features, callrfe.support_
. Note: on the testing set only transform is executed, so that the features inrfe.support_
are used to filter out other features from the testing set. - Fits a model and return a tuple (y_test, y_pred).
The y_test
and y_pred
can be used to analyze the performance of the model, e.g., its precision.
The precision in saved in an array, and the procedure is repeated 3 times.
Finally, we print the average precision.
We simulated a cross-validation procedure, by splitting the original data 3 times in their respective training and testing set, fitted a model, computed and averaged its performance (i.e., precision) across the three folds. This process can be simplified using a RepeatedKFold validation:
from sklearn.model_selection import RepeatedKFold precisions = list() rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1) for train_index, test_index in rkf.split(X, y): # print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] X_train_transformed = rfe.fit_transform(X_train, y_train) X_test_transformed = rfe.transform(X_test) fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train) y_pred = fitted_model.predict(X_test_transformed) precisions.append(precision_score(y_test, y_pred)) print('Average precision:', np.mean(precisions)) """ Average precision: 0.93 """
and even further with Pipeline:
from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1) pipeline = Pipeline(steps=[('s',rfe),('m',DecisionTreeClassifier())]) precisions = cross_val_score(pipeline, X, y, scoring='precision', cv=rkf) print('Average precision:', np.mean(precisions)) """ Average precision: 0.93 """
In summary, when the original data is passed to the Pipeline, the latter:
- splits it in training and testing data;
- calls
RFE.fit_transform()
on the training data; - applies
RFE.transform()
on the testing data so that it consists of the same features; - calls
estimator.fit()
on the training data to fit (i.e., train) a model; - calls
estimator.predict()
on the testing data to predict it. - compares the predictions with the actual values and save the performance results (the one you passed to the
scoring
parameter) internally. - Repeats steps 1-6 for every split in the cross-validation procedure
At the end of the procedure, someone can access the performance results and average them across the folds.