Is there a way to use mutual information as part of a pipeline in scikit learn?

Question

I'm creating a model with scikit-learn. The pipeline that seems to be working best is: mutual_info_classif with a threshold - i.e. only include fields whose mutual information score is above a given threshold. PCA LogisticRegression I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second

Accepted Answer

You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:from sklearn.feature_selection import SelectFromModel, mutual_info_classiffrom sklearn.linear_model import LogisticRegressionfrom sklearn.base import BaseEstimatorfrom sklearn.pipeline import Pipelinefrom sklearn.decomposition import PCAX = [[ 0.87, -1.34,  0.31 ],     [-2.79, -0.02, -0.85 ],     [-1.34, -0.48, -2.55 ],     [ 1.92,  1.48,  0.65 ]]y = [0, 1, 0, 1]class MutualInfoEstimator(BaseEstimator):    def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):        self.discrete_features = discrete_features        self.n_neighbors = n_neighbors        self.copy = copy        self.random_state = random_state        def fit(self, X, y):        self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,                                                         n_neighbors=self.n_neighbors,                                                         copy=self.copy, random_state=self.random_state)    feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))pca = PCA(random_state=100)lr = LogisticRegression(random_state=200)pipe = Pipeline(    [        ('feat_sel', feat_sel),        ('pca', pca),        ('pred', lr)    ])print(pipe)Pipeline(steps=[('feat_sel',                 SelectFromModel(estimator=MutualInfoSelector(random_state=0))),                ('pca', PCA(random_state=100)),                ('pred', LogisticRegression(random_state=200))])Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.Yeah, I do not think there is another way to do it. At least not that I know!

Advertisement

Answer