Skip to content
Advertisement

Cross validation with grid search returns worse results than default

I’m using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the “best” parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I’m surprised this would happen.

For example:

JavaScript

This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.

One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.

Is there a reason my grid search overlooked the “superior” defaults?

Advertisement

Answer

Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that’s at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.

Assuming you’re using the iris dataset (that’s the dataset used in the example in your comment link), here’s an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:

JavaScript

Now repeat the grid search using a random training subset:

JavaScript

I’m seeing much higher classification accuracy with both of these approaches, which makes me think maybe you’re using different data – but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement