Skip to content
Advertisement

Is there a way to use mutual information as part of a pipeline in scikit learn?

I’m creating a model with scikit-learn. The pipeline that seems to be working best is:

  1. mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
  2. PCA
  3. LogisticRegression

I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:

pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('dim_red', pca),
        ('pred', lr)
    ]
)

But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Advertisement

Answer

You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:

from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]


class MutualInfoEstimator(BaseEstimator):
    def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
        self.discrete_features = discrete_features
        self.n_neighbors = n_neighbors
        self.copy = copy
        self.random_state = random_state
    

    def fit(self, X, y):
        self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, 
                                                        n_neighbors=self.n_neighbors, 
                                                        copy=self.copy, random_state=self.random_state)
    

feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)

pipe = Pipeline(
    [
        ('feat_sel', feat_sel),
        ('pca', pca),
        ('pred', lr)
    ]
)

print(pipe)
Pipeline(steps=[('feat_sel',
                 SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
                ('pca', PCA(random_state=100)),
                ('pred', LogisticRegression(random_state=200))])

Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.

Yeah, I do not think there is another way to do it. At least not that I know!

Advertisement