I’m creating a model with scikit-learn. The pipeline that seems to be working best is:
- mutual_info_classif with a threshold – i.e. only include fields whose mutual information score is above a given threshold.
- PCA
- LogisticRegression
I’d like to do them all using sklearn’s pipeline object, but I’m not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
But I don’t see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
Advertisement
Answer
You can implement your Estimator
by subclassing BaseEstimator
. Then, you can pass it as estimator
to a SelectFromModel
instance, which can be used in a Pipeline
:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!