I’m creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
- PCA
- LogisticRegression
I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('dim_red', pca),
        ('pred', lr)
    ]
)
But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
Advertisement
Answer
You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
    def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
        self.discrete_features = discrete_features
        self.n_neighbors = n_neighbors
        self.copy = copy
        self.random_state = random_state
    
    def fit(self, X, y):
        self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, 
                                                        n_neighbors=self.n_neighbors, 
                                                        copy=self.copy, random_state=self.random_state)
    
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
    [
        ('feat_sel', feat_sel),
        ('pca', pca),
        ('pred', lr)
    ]
)
print(pipe)
Pipeline(steps=[('feat_sel',
                 SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
                ('pca', PCA(random_state=100)),
                ('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!