Skip to content
Advertisement

Cross-validation with time series data in sklearn

I have a question with regard to cross-validation of time series data in general. The problem is macro forecasting, e.g. forecasting the 1-month ahead Price of the S&P500 using different monthly macro variables. Now I read about the following approach: One should/could use a rolling cross-validation approach. I.e. always drop an old monthly value and add a new one (= rolling) and then forecast the next months value of the S&P500. But now there should be a 1 month gap always between training data and predicting the next month value due to “data leakage” concerns. My problem is I do not get why one should use always 1 gap between training and validation. I do not see the data leakage concern in this approach?

Thanks for your help!

Advertisement

Answer

Scikit learn does not cover all the bases when it comes to cross validation of time series models. Also, there are many models that only exist in the Statsmodels suite.

In any case, you are on the right track seeking a rolling window CV. This post illustrates some other options available.

Have a look at this tutorial if you care to create a bespoke function to undertake the walk forward validation with a sliding window. It can be adapted to work as a rolling window CV.

As for the data leakage, you will need to leave a gap equivalent to the number of h-steps ahead you are forecasting. This can be accomplished with the walk forward validation.

Imagine you are forecasting the price of a stock one month ahead and you have data up until the 1st of August 2020 available in the training set. Your prediction for the 1st of September 2020 will not use data that has leaked. After predicting it, you can add the 2nd of August 2020 to the training set and continue to walk forward. If you don’t update the training set, you might end up forecasting data in the end of September with information from the beginning of August, with more than one month gap among them.

Advertisement