Towards a more reproducible ecology
Submitted by editor on 1 April 2016.By Michael Krabbe Borregaard (mkborregaard [at] snm [dot] ku [dot] dk)
and Edmund M. Hart (edmund [dot] m [dot] hart [at] gmail [dot] com)
The workflow of ecological scientists is currently undergoing a quiet revolution. Ecological and geographical field data are becoming available to the wider research community at an unprecedented rate, and recent years have witnessed a strong push towards a more open sharing of research data. The wide availability of data has also pushed scientists to focus more intensively on the process of data analysis. Where computer programming ability was restricted to a very small subsets of researchers just a few years ago, the new generation of ecologists are trained programmers, developing novel software for analyses and exploring new ways to share and visualize data. This development is moving the field away from click-and-calculate statistical packages. A new paradigm has emerged, where individual scientists download, curate and share large amounts of data and analyse it using reproducible software packages and scripts written in languages such as R, Python and Julia.
A powerful approach to enhancing reproducibility with software is to define standardized and documentable workflows. The figure shows an example from ’ENM Components: a new set of web service-based workflow components for ecological niche modelling’, which describes a standardized workflow for fitting Environmental Niche Models to species distributions.
These developments in ecology have exciting implications – the availability of large amounts of data and the explosion in the analytical capabilities of ecologists, together with the potential for rapid dissemination of ideas in today’s internet-based scientific community, means that ecology is moving forward rapidly, with a steep growth in the number of research papers. But this development also poses important challenges. The large amount of project-specific software being generated for analytical studies means that analytical standards are harder to establish, potentially limiting the reproducibility of much of recently published science.
Reproducibility is the very hallmark of the scientific method. However, there is an increasing concern that many studies today might not be reproducible. The focus on novelty in ‘high-impact’ journals means that there is little incentive for researchers to directly replicate published studies. What is more worrying, efforts to systematically replicate published studies have often failed. Also, ecology may face particular challenges in reproducibility because data collection is often context dependent, and because there are few established standards for storing metadata and facilitating study replication.
Video tutorial by Justin Kitzes and Mark Wilber to ’macroeco: reproducible ecological pattern analysis in Python’ https://youtu.be/mQ4LVw_MQg8
A crucial aspect of reproducibility is to minimize the number of errors in published data. The package 'Biogeo: an R package for assessing and improving data quality of occurrence record datasets’ offers facilities for correcting common errors and quality issues with occurrence records found in large data bases. The figure shows a potential geographical location error highlighted by the package, along with possible candidate suggestions for the correct loction. Records for the species that is found at the selected outlier (Species G) are indicated in black and records for all other species are in blue. Records that fall outside of country boundaries are shown in red.
Substantial progress in our understanding of ecology rests on trustworthy, reproducible and transparent data analysis. The keys to a greater level of reproducibility in ecology are to establish analytical protocols that are robust and transparent, to faithfully document the analytical process including any failed attempts, and to ensure that the storage and acquisition of data is documented and includes the appropriate metadata. The special issue ‘Tools for Reproducibility in Ecology’ seeks to promote this quest for a reproducible ecological science and highlight recent developments, while presenting a collection of software notes that aim to explicitly further scientific reproducibility in ecological data analysis. The notes were chosen to highlight a breadth of topics and approaches that are required to ensure reproducibility. These tools exemplify different ways of establishing documentable and standardised workflows, where the process of data acquisition, analysis and graphical output is integrated and documented throughout, and collaborative work is integrated into the software itself.
In doing so, the tools contribute to a more reproducible ecology in which analyses rest on solid, error-checked software, without stymieing the free growth of creative analytical ideas; and where documentation and metadata support a solid foundation under today’s fast-moving integrative ecological research field. The sooner these considerations are integrated into our workflows and collaborations, the stronger the foundation of the ecology we build for the future.
Standardized computational formats for the deposition and acquisition of data and metadata encourages the sharing of data, reduces the prevalence of data errors and ensures that data acquisition is documented and reproducible. One such powerful standard for ecological networks is defined by ’mangal – making ecological network analysis simple'. The figure shows a spatial plot of an ecological network derived easily from data acquired within the mangal R package. The circles in the inset map show the location of the sites. Each dot in the main map represents a species, with symbiotic mutualisms drawn between them.
Full issue free to read here