I’m creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
- PCA
- LogisticRegression
I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('dim_red', pca), ('pred', lr) ] )
But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
Advertisement
Answer
You can implement your Estimator
by subclassing BaseEstimator
. Then, you can pass it as estimator
to a SelectFromModel
instance, which can be used in a Pipeline
:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif from sklearn.linear_model import LogisticRegression from sklearn.base import BaseEstimator from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA X = [[ 0.87, -1.34, 0.31 ], [-2.79, -0.02, -0.85 ], [-1.34, -0.48, -2.55 ], [ 1.92, 1.48, 0.65 ]] y = [0, 1, 0, 1] class MutualInfoEstimator(BaseEstimator): def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None): self.discrete_features = discrete_features self.n_neighbors = n_neighbors self.copy = copy self.random_state = random_state def fit(self, X, y): self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, n_neighbors=self.n_neighbors, copy=self.copy, random_state=self.random_state) feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0)) pca = PCA(random_state=100) lr = LogisticRegression(random_state=200) pipe = Pipeline( [ ('feat_sel', feat_sel), ('pca', pca), ('pred', lr) ] ) print(pipe)
Pipeline(steps=[('feat_sel', SelectFromModel(estimator=MutualInfoSelector(random_state=0))), ('pca', PCA(random_state=100)), ('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!