Is it possible to optimize hyperparameters for optional sklearn pipeline steps?

Question

I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel - sfm). The error that I get is 'string' object has no attribute 'set_params' which is understandable.

Accepted Answer

As specified by @Robin, you might define p_grid_lr as a list of dictionaries. Indeed, here is what the docs of GridSearchCV states at this proposal:param_grid: dict or list of dictionariesDictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.p_grid_lr = [    {        "clf__max_depth": [10, 50, 100, None],        "clf__n_estimators": [10, 50, 100, 200, 500, 800],        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],        "sfm__estimator__max_depth": [10, 50, 100, None],        "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],        "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],    },    {        "clf__max_depth": [10, 50, 100, None],        "clf__n_estimators": [10, 50, 100, 200, 500, 800],        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],        "sfm": ['passthrough'],    }]A less scalable alternative (for your case) might be the followingp_grid_lr_ = {    "clf__max_depth": [10, 50, 100, None],    "clf__n_estimators": [10, 50, 100, 200, 500, 800],    "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],    "sfm": ['passthrough',             SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),            ...]}specifying all of the possible combinations for your parameters.Moreover, be aware that to access parameters max_depth, n_estimators and max_features from the RandomForestRegressor estimator within SelectFromModel you should type parameters as"sfm__estimator__max_depth": [10, 50, 100, None],"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']rather than as"sfm__max_depth": [10, 50, 100, None],"sfm__n_estimators": [10, 50, 100, 200, 500, 800],"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']because these parameters are from the estimator itself (max_features in principle might also be a parameter from SelectFromModel, but in such a case it may only attain integer values as from docs).In general you can access all the parameters to be possibly optimized via pipeline.get_params().keys() (estimator.get_params().keys() in general).Eventually, here&#8217;s a nice reading from the user guide for Pipelines.

Advertisement

Answer