Structure in your data can undermine your model validations: the problem and some solutions
Submitted by editor on 13 July 2017. Get the paper!Statistical models in ecology are used not just to describe the present state of natural systems, but also to predict their change or development over time. Such models are fairly simple to create and have thus become ubiquitous in all areas of ecological research. To determine whether these statistical simplifications of ecological systems are useful, we need effective model validation procedures that produce reliable error estimates. Unfortunately, many popular evaluation and cross-validation approaches may result in erroneous and misleading assessments of model performance. We need to do better!
Some of these issues are well-known and often discussed, such as the problem non-independence of validations, as is the case with random data splitting. Such effects are detectable as they persist as dependence structures in model residuals (i.e. residual autocorrelation). More sinister is the often overlooked and difficult to detect problem of model overfit to data structure via correlated predictors. When predictor variables are correlated, for example, in space, the spatial structure of the data can be fit into the model through these variables (unbeknown to the modeller, as structure would not persist in residuals). Blocking in cross-validation, by systematically dividing data across the dependence structure into training and testing sets, can address these problems. However, blocking can also introduce a host of other challenges, not least forcing models to extrapolate by restricting the ranges or combinations of predictor variables available for model training.
Through a comprehensive review of the ecological cross-validation literature and through a series of simulations and case studies, we demonstrate (for all instances tested) that block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We therefore recommend that block cross-validation be used wherever dependence structures exist in a dataset (which seems to be everywhere in ecology!), even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.
Our simulations address spatial and phylogenetic structure in data. Our two data case studies address 1) blocking by individuals and groups in a resource selection function (RSF) example using animal telemetry data, and 2) assessing extrapolation in environmental predictor space via a typical species distribution modelling (SDM) example using a widespread tree species. We provide complete case study data as well as comprehensive R code that includes our various block cross-validation implementations.