Skip to content
Advertisement

Is there a way to use mutual information as part of a pipeline in scikit learn?

I’m creating a model with scikit-learn. The pipeline that seems to be working best is:

  1. mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
  2. PCA
  3. LogisticRegression

I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:

JavaScript

But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

Advertisement

Answer

You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:

JavaScript
JavaScript

Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.

Yeah, I do not think there is another way to do it. At least not that I know!

Advertisement