I’m using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the “best” parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I’m surprised this would happen.
For example:
from sklearn import svm, grid_search from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier(verbose=1) parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1], 'min_samples_split':[2,5,10,20], 'max_depth':[2,3,5,10]} clf = grid_search.GridSearchCV(gbc, parameters) t0 = time() clf.fit(X_crossval, labels) print "Gridsearch time:", round(time() - t0, 3), "s" print clf.best_params_ # The output is: {'min_samples_split': 2, 'learning_rate': 0.01, 'max_depth': 2}
This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.
One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.
Is there a reason my grid search overlooked the “superior” defaults?
Advertisement
Answer
Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that’s at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.
Assuming you’re using the iris
dataset (that’s the dataset used in the example in your comment link), here’s an example of how GridSearchCV
parameter optimization is affected by first making a holdout set with train_test_split
:
from sklearn import datasets from sklearn.model_selection import GridSearchCV from sklearn.ensemble import GradientBoostingClassifier iris = datasets.load_iris() gbc = GradientBoostingClassifier() parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1], 'min_samples_split':[2,5,10,20], 'max_depth':[2,3,5,10]} clf = GridSearchCV(gbc, parameters) clf.fit(iris.data, iris.target) print(clf.best_params_) # {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}
Now repeat the grid search using a random training subset:
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42) clf = GridSearchCV(gbc, parameters) clf.fit(X_train, y_train) print(clf.best_params_) # {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}
I’m seeing much higher classification accuracy with both of these approaches, which makes me think maybe you’re using different data – but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.