No longer a black box: how did Maxent go open-source?

Submitted by editor on 3 July 2017. Get the paper!

By Mary E. Blair and Robert P. Anderson

 

Fifteen years ago, Steven Phillips, a computer science researcher at AT&T Labs from South Africa, walked a few blocks from his apartment in New York City to the American Museum of Natural History (AMNH). He quickly learned about the Museum’s efforts in biodiversity conservation through its Center for Biodiversity and Conservation (CBC) and asked the CBC’s then-Director Eleanor Sterling how he could help. After a few months learning about various ongoing projects at the CBC, he realized that machine learning methods being researched at AT&T Labs might be well suited to an important problem being studied at the AMNH. Sterling had been organizing brainstorming meetings about an emerging field pioneered in Australia that held great potential in biodiversity science: modeling species’ distributions using georeferenced occurrence localities and a set of environmental grids. Research collections of natural history museums like the AMNH hold exactly that kind of locality information, and recently available climatic information provided the latter. Sterling’s group included GIS specialist Ned Horning, as well as three scientists who had brought distribution modeling expertise to the New York City region from natural history museums at two major hubs of biodiversity research, the University of Kansas and the University of California, Berkeley: AMNH curator Christopher Raxworthy, AMNH postdoc Robert Anderson, and Stony Brook University professor Catherine Graham.

 

Robert Schapire, a machine learning expert at AT&T Labs, suggested an approach with a long history called maximum entropy, or Maxent, and he and Phillips developed an application of it for modeling species’ distributions from occurrence locality data. They applied it to Anderson’s data for small mammals from Ecuador, and the brainstorming group was so impressed with the results that Graham got the nascent Maxent team invited to a major working group on the topic. Funded by the U.S. National Science Foundation via the National Center for Ecological Analysis and Synthesis in Santa Barbara, California, Craig Moritz and Townsend Peterson (of Berkeley and Kansas, respectively) had assembled scientists from eight countries to compare the various techniques being used to model species’ distributions. Miroslav Dudík, a Ph.D. student from Slovakia, began tackling technical issues of applying Maxent to this question, with Schapire as his dissertation advisor at Princeton University (where he had recently moved). Soon beginning a faculty position at the City College of New York/City University of New York, Anderson collaborated with Phillips, Schapire and Dudík to link the mathematics to ecological theory and the realities of biodiversity data. The computer scientists found that biodiversity questions (and data) challenged them, providing a new “problem reservoir” well-suited to the multidisciplinary area of maximum entropy, which had existed at least since the 1950s - a great example of reciprocal illumination between fields. With these advances, Maxent performed well in the NCEAS working group. The principal paper reporting the results of the working group itself became a classic for the field, published in Ecography (Elith et al. 2006).

 

The paper documenting the implementation of Maxent to modeling species distributions has been cited more than 7000 times (Phillips et al. 2006), and Maxent has become the most widely used algorithm for modeling species’ distributions—with use in diverse fields including biogeography, evolution, ecology, global change, and of course, biodiversity conservation. Although the underlying mathematics and the application were documented in the original and subsequent publications (e.g. Phillips and Dudík 2008), the Maxent source code was owned by AT&T, which often led users to refer to the application as a “black box”. Until now.

 

Over the years, the scientists’ careers took them to other jobs and cities, but things came full circle in 2015, after Dudík moved to New York City and reconnected with Anderson. When socializing with their spouses, the two friends tried to stay away from work conversation. However, in early 2016, during the revelry of his daughter’s first birthday party, Dudík made the passing comment that AT&T had recently approved release of the Maxent source code. Anderson replied that this was a really big deal, and that they should make sure that it was packaged and documented in a way that would allow a smooth transition—and promote new advancements. A transition also made sense since Schapire had recently moved to Microsoft from Princeton, which had hosted the Maxent software for a decade. Soon, work began to polish what would become the open-source version of Maxent, and to bring it home to the AMNH’s CBC. The team now worked with Mary Blair, Ned Horning, and Peter Ersts at the CBC to work through details of licensing and hosting of the software, as well as documenting changes to the code and their implications for ecological interpretations of the output. The now-burgeoning field of distribution modeling had moved forward tremendously in 10 years (including critical contributions to the understanding of Maxent by scientists including Geert Aarts, Jane Elith, Will Fithian, Trevor Hastie, Cory Merow, Aleksandar Radosavljevic, Ian Renner, Mariya Shcheglovitova, Jeremy VanDerWal, David Warton, and many others), making it necessary to place the new Maxent in the current state of the field.

 

In this month’s issue of Ecography, we join Phillips, Dudík, and Schapire to present the new Maxent, now released under the MIT open-source license (Phillips et al. 2017). In addition to fully open-source code, updates to the software include changes to the default feature types (threshold features are now turned off by default) and the addition of a cloglog transform as the default output format (which was formerly the logistic transform). Also, under particular assumptions about the input data and biological sampling efforts that led to occurrence localities, the output can be interpreted as predicted probability of presence (cloglog transform), or as predicted local relative abundance (raw exponential output). In addition to the Maxent application, the new Maxent download site contains an updated software tutorial and a link to the source code on GitHub.

 

The new Maxent is also accompanied by an R package authored by Phillips, maxnet, which uses a standard regularized logistic regression package to fit models while still implementing all the derived feature classes of the Maxent Java application (especially hinge features). Although the R package does not include all of the ‘bells and whistles’ of the Maxent Java application, the authors strongly encourage collaboration and further development of the R package among the Maxent user community to envision new ways to better integrate Maxent modeling with the wide variety of visualization and analysis tools available in R.

 

The geospatial community is a leader in open-source and free software development, and increased openness is not only an ethical imperative but also necessary to answer key pressing questions about global change. The story of Maxent illustrates how scientific progress flourishes out of international collaboration and public-private partnerships, with important benefits for both applied and basic research. We celebrate this next step in Maxent’s journey - advancement to open-source - and invite developers and modelers to imagine new capabilities for Maxent, as well as to contribute to its continued open-source development.

 

 

References:

Elith, J., C. H. Graham, R. P. Anderson, M. Dudík, S. Ferrier, A. Guisan, R. J. Hijmans, F. Huettmann, J. R. Leathwick, A. Lehmann, J. Li, L. G. Lohmann, B. A. Loiselle, G. Manion, C. Moritz, M. Nakamura, Y. Nakazawa, J. M. Overton, A. T. Peterson, S. J. Phillips, K. Richardson, R. Scachetti-Pereira, R. E. Schapire, J. Soberón, S. Williams, M. S. Wisz, and N. E. Zimmerman. 2006. Novel methods improve prediction of species’ distributions from occurrence data. - Ecography, 29:129-151.

 

Phillips, S. J. R. P. Anderson, and R. E. Schapire. 2006. Maximum entropy modeling of species geographic distributions. - Ecological Modelling 190: 231-259.

 

Phillips, S. J. and M. Dudík. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. - Ecography 31: 161-175.

 

Phillips, S. J., R. P. Anderson, M. Dudík, R. E. Schapire, and M. E. Blair. 2017. Opening the black box: an open-source release of Maxent. - Ecography 40: 887-893.

Categories: 
General

Comments