The challenge of modeling niches and distributions for data‐poor species

Submitted by editor on 22 May 2018. Get the paper!
Optimal and default Maxent models for the Malagasy rodent Eliurus majori (logistic output). Results correspond to the unfiltered dataset (top) and filtered dataset (bottom), and three ways of determining model settings: AICc (left), sequential criteria based on performance on withheld (via jackknife) data (middle), and default settings (right). Plotted localities (white dots) represent unfiltered or filtered occurrence records for the corresponding row. Greyscale map shows elevation for reference (Hijmans et al. 2005).

Editor's Choice May

By Peter Galante et al.

Find us on twitter: @GalantePete  @RPAbiodiversity  @BobMuscarella  @sharonjansa​


Most species are known from few occurrences

Overall worldwide, and especially in the tropics, most species are known from very few occurrence records. Species that seem rare and known from few occurrences could be truly rare, but may also represent elusive, hard to observe, and even cryptic species not recognized by taxonomists as distinct species. In our study, we examined the niche a data-poor member of the Nesomyidae family of rodents in Madagascar, an extremely biodiverse island in the Western Indian Ocean.

What is a species’ niche?

A species’ niche is characterized by the biotic and abiotic factors that influence the conditions and places that the species can occupy. Abiotic factors of a species’ niche include aspects of temperature or precipitation – climate – in which the species can live and thrive. Biotic factors include parasites, obligate pollinators, or even the artificial removal or addition of a forest.

Ecological Niche Modeling

Ecological niche modeling has increased in use in the past decade. In addition to earlier regression-based techniques, machine learning algorithms such as Maxent have become popular. All of these correlative approaches characterize correlations between species’ occurrences and the environmental conditions at these localities, generally by comparing these to the overall environment of a region. In general, models are built (trained) on a subset of the occurrence localities and then tested on the remaining ones to see how well the model can predict them. Maxent has the ability to fit very complex relationships between a species and its environment. However, these complex relationships often result in models that are overfit to the training data, and poorly predict the withheld data.

To reduce the problem of overfitting, collaborators and I had previously developed an R package, ENMeval, that builds a suite of models ranging from low to high complexity. From these candidate models, several techniques are available to select an optimal model. In the present experiment, we compared two fairly commonly used techniques (a sequential criteria of: 1) low omission rate and then high AUC; versus 2) AICc). These were also compared with Maxent’s default settings.

Sampling bias paradigm

In addition, in the present paper we also compared these two techniques under two differing levels of sampling bias. Sampling bias occurs when species are not sampled evenly across the study region. For example, most species are sampled more frequently along roads or established field sites. In such cases, we may end up not actually modeling the true full suite of conditions suitable for the species, but rather only those from the easily accessible areas where the species may be found. To combat this, we removed clusters of localities so that no two localities were less than 10 km away from each other, hopefully giving a more representative characterization of the species’ climatic associations.


Before we controlled for sampling bias, the two model-selection techniques selected quite different models.  Both of those selected models overfit to the training data, showing relatively small suitable areas. In addition, the two corresponded to very different settings of Maxent.

However, when we reduced the effects of sampling bias through spatial filtering, both models selected were remarkably similar and corresponded to very similar Maxent settings.

Importantly, there were major differences between the models selected by either technique versus the one made using the default settings of Maxent. The default model was consistently overfit, with poorer evaluation statistics, especially when the occurrences were filtered.

Three evaluation statistics for unfiltered (top row) and filtered (bottom row) locality datasets for Eliurus majori resulting from optimization of Maxent models. Left panels (a, f) show the difference in AICc scores between each model and the model that received the lowest AICc score. Middle panels (b, e) show omission rates of withheld data at the 10% calibration threshold. Right panels (c, f) show the test AUC values for each model. In each panel, arrows point to the optimal and default models, showing how changes in model settings can affect selection criteria. Model statistics are shown as feature classes (L = linear, LQ = linear + quadratic, H = hinge, LQH = linear + quadratic + hinge) increasing in regularization multiplier.

Overall, our study showed that some best practices in the literature (reducing sampling bias, model tuning) led to higher performing, less overfit models. Furthermore, one of them led to more consistent results regardless of the level of correction for sampling bias, indicating possible robustness to this issue. We hope that other researchers will address these issues with other species, allowing general conclusions and recommendations for the field.