Cross validation with grid search returns worse results than default

Question

I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I'm surprised this would happen. For example: This is the same as the defaults, except max_depth

Accepted Answer

Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset.  It looks like that&#8217;s at least part of the problem here.  Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.  Assuming you&#8217;re using the iris dataset (that&#8217;s the dataset used in the example in your comment link), here&#8217;s an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:  from sklearn import datasetsfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import GradientBoostingClassifieriris = datasets.load_iris()gbc = GradientBoostingClassifier()parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],               'min_samples_split':[2,5,10,20],               'max_depth':[2,3,5,10]}clf = GridSearchCV(gbc, parameters)clf.fit(iris.data, iris.target)print(clf.best_params_)# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}Now repeat the grid search using a random training subset:  from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target,                                                  test_size=0.33,                                                  random_state=42)clf = GridSearchCV(gbc, parameters)clf.fit(X_train, y_train)print(clf.best_params_)# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}I&#8217;m seeing much higher classification accuracy with both of these approaches, which makes me think maybe you&#8217;re using different data &#8211; but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here.  Hope it helps.

Advertisement

Answer