I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel – sfm).
clf = RandomForestRegressor(random_state = 1) stdscl = StandardScaler() sfm = SelectFromModel(RandomForestRegressor(random_state=1)) p_grid_lr = {"clf__max_depth": [10, 50, 100, None], "clf__n_estimators": [10, 50, 100, 200, 500, 800], "clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'], "sfm": ['passthrough', sfm], "sfm__max_depth": [10, 50, 100, None], "sfm__n_estimators": [10, 50, 100, 200, 500, 800], "sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'], } pipeline=Pipeline([ ('scl',stdscl), ('sfm',sfm), ('clf',clf) ]) gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1) gs_clf.fit(X_train, y_train) clf = gs_clf.best_estimator_
The error that I get is ‘string’ object has no attribute ‘set_params’ which is understandable. Is there a way to specify which combinations should be tried together, in my case only ‘passthrough’ by itself and sfm with different hyperparameters?
Thanks!
Advertisement
Answer
As specified by @Robin, you might define p_grid_lr
as a list of dictionaries. Indeed, here is what the docs of GridSearchCV
states at this proposal:
param_grid: dict or list of dictionaries
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
p_grid_lr = [ { "clf__max_depth": [10, 50, 100, None], "clf__n_estimators": [10, 50, 100, 200, 500, 800], "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'], "sfm__estimator__max_depth": [10, 50, 100, None], "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800], "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'], }, { "clf__max_depth": [10, 50, 100, None], "clf__n_estimators": [10, 50, 100, 200, 500, 800], "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'], "sfm": ['passthrough'], } ]
A less scalable alternative (for your case) might be the following
p_grid_lr_ = { "clf__max_depth": [10, 50, 100, None], "clf__n_estimators": [10, 50, 100, 200, 500, 800], "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'], "sfm": ['passthrough', SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)), SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)), ...] }
specifying all of the possible combinations for your parameters.
Moreover, be aware that to access parameters max_depth
, n_estimators
and max_features
from the RandomForestRegressor
estimator within SelectFromModel
you should type parameters as
"sfm__estimator__max_depth": [10, 50, 100, None], "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800], "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
rather than as
"sfm__max_depth": [10, 50, 100, None], "sfm__n_estimators": [10, 50, 100, 200, 500, 800], "sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
because these parameters are from the estimator itself (max_features
in principle might also be a parameter from SelectFromModel
, but in such a case it may only attain integer values as from docs).
In general you can access all the parameters to be possibly optimized via pipeline.get_params().keys()
(estimator.get_params().keys()
in general).
Eventually, here’s a nice reading from the user guide for Pipelines.