Model assessment and model selection in time series

- Tags:: #📝CuratedNotes, [[Forecasting]] ## Model assessment ### Hold-out validation When developing ML models, we need to be aware of the [[bias-variance tradeoff]]: the model error has three sources, bias (wrong assumptions about the underlying process, underfitting), variance (sensitivity to non-generalizable variations in training data, overfitting), and irreducible error (noise). We can assess underfitting simply looking at the model performance in training data. To assess overfitting, that is, how well a model generalizes, there are two options: a) Correcting the metrics computed over the training set to take into account model complexity (e.g., with [[Akaike's Information Criterion (AIC)]] or other estimators). b) Directly estimate the generalization error by making predictions over data not used in training (out-of-sample). I will focus on this one since it requires less assumptions over the models (See [[📖 An Introduction to Statistical Learning]], p. 232). The simplest form would be having a training and a test set: ![[Pasted image 20220521104950.png|500]] ([Michael Galarnyk](https://twitter.com/GalarnykMichael).) ### Cross-validation However, if you don't have a lot of (representative) data, the error on the test set won't be a good estimation of the real performance of the model, and thus, you would have different results by doing different splits. We can have a more accurate measure of the true performance of the model by using cross-validation: averaging the model scores over multiple training/test splits (this is a form of resampling as in [[📖 Introductory Statistics and Analytics. A Resampling Perspective]]). There are several cross-validation techniques, such as LOOCV (leave-one-out-cross-validation): ![[LOOCV.gif|300]] ([Wikimedia](https://commons.wikimedia.org/wiki/File:LOOCV.gif)) Cross-validation is almost always more interesting to use than hold-out validation, but there are practical stuff for which you may use hold-out validation: [machine learning - Hold-out validation vs. cross-validation - Cross Validated](https://stats.stackexchange.com/questions/104713/hold-out-validation-vs-cross-validation/104750#104750) ## Model selection It is a bit confusing in the literature, but [[Model selection]] is not only about selecting between types of algorithms (e.g., exponential smoothing vs. ARIMA), but also includes hyperparameter optimization, feature selection, time series transformation... so model selection really is about selecting the best performing pipeline (see [A Gentle Introduction to Model Selection for Machine Learning (machinelearningmastery.com)](https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/#:~:text=Model%20selection%20is%20the%20process%20of%20selecting%20one%20final%20machine,SVM%2C%20KNN%2C%20etc.) and [A Gentle Introduction to Machine Learning Modeling Pipelines (machinelearningmastery.com)](https://machinelearningmastery.com/machine-learning-modeling-pipelines/)). It may be tempting to use the estimates of the previous section over the test sets to perform model selection. However, the more we use the same test set, the more information from it leaks into our models. So we would be losing the ability of knowing how the model generalizes, that is, overfitting. We cannot use the same data to simultaneously tune models(pipelines) and estimate their true performance out-of-sample. ### Validation set The simplest way to solve this if again we have enough data would be to split data into three: training data, validation data (which we use for model selection). ![[Pasted image 20220525073026.png|300]] ([File:ML dataset training validation test sets.png - Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ML_dataset_training_validation_test_sets.png)) 1. We use the training data for training multiple models (with different features, pre-processing...) 2. Select the combination that score best on validation. 3. Re-train the best combination on training + validation data. 4. Test its score on the test set. But again, unless the validation set and test are large and representative, we could be overfitting (non-representative validation set) and/or having a bad estimation of the performance of the model (non-representative test set). Again, cross-validation comes to the rescue, but in a double loop. ### Nested cross-validation ![[Pasted image 20220531071815.png]] ([Nested Resampling • mlr (mlr-org.com)](https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html)) We have two cross validation loops: an outer and an inner loop. The inner loop is used for model selection, and the outer loop for performance evaluation. It works as follows: 1. Make one split of training/validation (outer loop), and perform cross-validation on the training set (inner loop). 2. Select the best performing combination of that inner cross-validation. 3. Train the best performing combination in the outer training set and test in on the outer test set. 4. Repeat this for each training/validation split on the outer loop. Following the picture, we would have 3 different models, one for each iteration of the outer loop. 5. Now is where it gets tricky. If you have 3 different models... which one do you select? None of them: you would run the model selection inner loop one more time, over the whole dataset and that would be the final model selection. 7. The estimate of the performance of that model is precisely the average of the performances of each outer loop models. This last step [is](https://stats.stackexchange.com/questions/52274/how-to-choose-a-predictive-model-after-k-fold-cross-validation) [not](https://stats.stackexchange.com/questions/456157/how-are-inner-loop-and-outer-loops-used-to-evaluate-and-build-a-machine-learning) [very](https://stats.stackexchange.com/questions/434154/nested-cross-validation-which-models-should-we-evaluate-in-the-outer-loop) [intuitive](https://stats.stackexchange.com/questions/244907/how-to-get-hyper-parameters-in-nested-cross-validation/245169#245169) and there are sources whose explanation doesn't match (see [machine learning - Cost of Nested Cross-Validation - Cross Validated (stackexchange.com)](https://stats.stackexchange.com/questions/531593/cost-of-nested-cross-validation/531602#531602)). >The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline (…) When evaluating a pipeline that uses an automatically-configured data transform, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose? **The answer is, it doesn’t matter** (…) **we are evaluating the “_model_” or “_modeling pipeline_” as an atomic unit**. ([A Gentle Introduction to Machine Learning Modeling Pipelines (machinelearningmastery.com)](https://machinelearningmastery.com/machine-learning-modeling-pipelines/). Also, a similar explanation can be found in [this SO answer: How to build the final model and tune probability threshold after nested cross-validation?](https://stats.stackexchange.com/a/233027/48652 )) In any case, we can for sure analyze what happened on each iteration of the outer loop and if the selected models differ wildly, we may have a problem: >The outer cross validation estimates the performance of this model fitting approach. For that you use the usual assumptions: > - the k outer surrogate models are equivalent to the "real" model built by `model.fitting.procedure` with all data. > - Or, in case 1. breaks down (pessimistic bias of resampling validation), at least the kk outer surrogate models are equivalent to each other. > This allows you to pool (average) the test results. It also means that you do not need to choose among them as you assume that they are basically the same. The breaking down of this second, weaker assumption is model instability. If the models are not stable, three considerations are important: >1. you can actually measure whether and to which extent this is the case, by using _iterated/repeated_ kk-fold cross validation. That allows you to compare cross validation results for _the same case_ that were predicted by different models built on slightly differing training data. >2. If the models are not stable, the variance observed over the test results of the kk-fold cross validation increases: you do not only have the variance due to the fact that only a finite number of cases is tested in total, but have additional variance due to the instability of the models (variance in the predictive abilities). >3. If instability is a real problem, you cannot extrapolate well to the performance for the "real" model. > ([Nested cross validation for model selection - Cross Validated (stackexchange.com)](https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection/65156#65156)) ## Model assessment and selection in time series However, in time series, you have temporal dependency (non-stationarity) that make data splits way harder for assessment and selection. They can no longer be random. - Training, validation, and test datasets need to be in sequence. Otherwise we introduce **data leaks**. - If I'm making a prediction for the next month, taking into account seasonality, where (or when, really) should be my validation and test sets? ### Training, validation and test datasets in sequence The most important thing to remember is that, when simulating predictions on past data, you cannot use information that was not available on the moment of the prediction. A couple of resources that explain well one of the nested CV strategies for time series, rolling windows, are [Time Series Nested Cross-Validation | by Courtney Cochrane | Towards Data Science](https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9): ![[Pasted image 20220409133705.png]] And [[📜 A flexible forecasting model for production systems (Greykite paper)]]: ![[greykit_cv.png]] The mechanism of nested CV remains the same as for non time-series problems, but: - Making sure that training, validation and test data are in sequence. - Exploiting the non-stationarity by cherry-picking the validation set (see next section). ### Model assessment and selection under non-stationarity: where do validation sets go? For non time-series problems, we talked about picking validation and tests sets so that we don't overfit to non representative samples of the data. However, in non-stationary time-series, we may use that to our advantage. >To assess how often we would need to re-train the model, we compared 2 strategies. We generated forecast on different months and measured the results: >- Fine-tune the model hyperparameters every month (with a grid search + a temporal cross-validation for picking the best combinaison) >- Use the same hyper-parameters for all the forecasts Re-training the model every month had better results. In conclusion, despite the temporal cross-validation the hyperparameters were not stable across time. ([[🗞️Is Facebook Prophet suited for doing good predictions in a real-world project]]) Regarding the outer CV loop, we should be more concerned about the reported performance on those test sets closer to the prediction we will be doing next, and the previous periods of our seasonality (e.g., if we have strong yearly seasonality and we are making a prediction for March, we are probably more interested in the performance of this year's February and previous years' March). And similarly on the inner CV loops. Where do we place the validation sets? The strategy to follow is well explained^[although it doesn't show that you still need a test set to get an estimate on the performance of the model selection process] on [Sales forecasting in retail: what we learned from the M5 competition | by Maxime Lutel | Artefact Engineering and Data Science | Medium](https://medium.com/artefact-engineering-and-data-science/sales-forecasting-in-retail-what-we-learned-from-the-m5-competition-445c5911e2f6): ![[Pasted image 20220409132639.png]] If you have enough data (big if), you want to maximize performance in recent periods, but also... >The problem is that these 3 months might have different specificities than the upcoming period that we are willing to forecast (…) it would be risky to rely only on these recent periods to tune it because this is not representative of what is going to happen next. So you want to also include the same period last year! (and even two years ago if you could).