I want to run a logistic regression using GridSearchCV
, but I want to contrast the performance when Scaling and PCA is used, so I don’t want to use it in all cases.
I basically would like to include PCA and Scaling as “parameters” of the GridSearchCV
I am aware I can make a pipeline like this:
mnl = LogisticRegression(fit_intercept=True, multi_class="multinomial") pipe = Pipeline([ ('scale', StandardScaler()), ('mnl', mnl)]) params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'mnl__max_iter':[500,1000,2000,3000]}
The thing is that, in this case, the scaling would be applied in all folds, right? Is there a way to make it so it’s “included” in the gridsearch?
EDIT:
I just read this answer and even though it’s similar to what I want, it’s not really it, because in that case the Scaler will be applied to the best estimator out of the GridSearch.
What I want to do is, for example, let’s say
params_mnl = {'mnl__solver': ['newton-cg', 'lbfgs']}
I want to run the regression with Scaler+newton-cg, No Scaler+newton-cg, Scaler+lbfgs, No Scaler+lbfgs.
Advertisement
Answer
You can set up the parameters with_mean
and with_std
of StandardScaler()
as False to represent no standerdization. In the GirdSearchCV
, the parameter para_grid
can be set up as
param_grid = [{'scale__with_mean': [False], 'scale__with_std': [False], 'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'mnl__max_iter':[500,1000,2000,3000] }, {'mnl__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'mnl__max_iter':[500,1000,2000,3000]} ]
Then the first dict in the list is “No Scaler+mnl” and the second is “Scaler+mnl”
Ref:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
Edit: I think it’s complicated if you also considering turn on/off PCA… Maybe you need to define a customised PCA which derives the original PCA. And then define additional boolean argument which determines whether the PCA should be executed or not…
class MYPCA(PCA): def __init__(self, PCA_turn_on, *args): super().__init__(*args) self.PCA_turn_on = PCA_turn_on def fit(X, y=None): if (PCA_turn_on == True): return super().fit(X, y=None) else: pass # same for other methods defined in PCA