From Plant Press, Vol. 21, No. 4, October 2018.
By Alex White
For many biologists, a core joy of our work is observing the natural world, formulating questions about our observations, and testing our ideas using clever experiments. Days are spent exploring museum cabinets or mountain valleys – few of us dream up romantic visions of deskwork at the computer. Computational research, however, is no doubt on the rise in biology, fueled by shrinking infrastructure costs, rapid computational advances, and a growing literacy among biologists in the relevant programming tools.
Two rapidly growing fields of ecology and evolution – (phylo)genomics and niche modeling – have already revealed the utility of computational tools when applied to biological questions, with far reaching implications for both basic research and conservation planning, particularly for plants. Yet the well of computational resources has just barely been tapped. For one, genomics and niche modeling take advantage of stereotypical data sources for studying ecology and evolution – sequence data and presence-absence data – but there are many non-traditional sources of data that computers are now poised to ingest and analyze with astonishing speed. These include complex data that machines are well suited to organize for analysis, such as continuous signals (light, audio, chemical, etc.), meshes from 3D scans, and images, whether microscopic, satellite, or traditional. And speed? Consider the latest collaboration between Berkeley National Laboratory, Oak Ridge National Laboratory, and NVIDIA (Kurth et al. 2018, arXiv:1810.01993), in which Kurth et al. showcased a machine learning model operating at 1.13 exaflops (1018 calculations per second). Their model was trained in under 3 hours to detect the pixel-level presence of tropical storms in exceptionally large (3.5 TB) atmospheric datasets. Such “deep learning” models hold promise for studying biological questions, but their development in the context of ecology and evolution is still in its early days.
Deep learning defines a class of computational models that are characterized by linking together a hierarchical chain of data transformations and calculations (i.e., fairly simple matrix algebra) to probabilistically “learn” features of a given dataset. Often, these models are used to supply predictions for a related dataset based on common features (this is called supervised learning and is the most common deep learning approach). One can easily visualize the steps involved in a simple deep learning model using Excel – indeed the most complex component is the learning process, where “known” datasets are passed through the model and thousands of randomly initialized parameters are updated iteratively to improve the model predictions. Primary layers of the models learn rough scale representations of the data where deeper layers learn fine scale differences between the features of interest.
One of the most common applications of deep learning is for object detection in images, where a labeled dataset of images is used to train a model that subsequently is used to predict the labels of unknown images. In 2012, the state-of-the-art image model was able to detect the differences between images containing dogs and those containing cats with an accuracy of approximately 80%. With the proper know-how, one can now build and train a dogs vs. cats model with near perfect accuracy in a matter of minutes.
With these rapid advances in mind, the goal of my postdoctoral research is to develop these methods for use in ecology and evolution and apply them to better understand patterns of biodiversity hidden within complex and noisy datasets. For example, colleagues and I at the University of Chicago recently developed a machine-learning model to evaluate ecological assemblages and their (phylogenetic, functional, and taxonomic) structure. Using this model, we can integrate local community surveys, phylogenetic trees, and biogeographic scale data to quantitatively characterize the regional distribution of biotas and their contributions to individual local assemblages. Unlike the examples I provided above, this model is unsupervised (meaning there are no known labels) and makes use of biogeographic data as images to learn the spatial relationships between species in order to generate characteristic biotas. We are currently using this approach to examine the relationship between bird and plant communities across the forest patches in the Himalayas, and we are finding strong concomitance between the structural patterns in both groups. While we are now working to examine the biological basis of this association with more fieldwork, it was the model itself that was the primary heuristic we used to detect this pattern (the methods are available as an R package called ecostructure).
Building off of these advances, as part of a collaboration between the National Museum of Natural History’s Department of Botany and the Smithsonian Data Science Lab, I am currently building deep learning models to more efficiently access the trove of data contained in our digitized herbarium. With millions of images now available, after years of heroic efforts by the herbarium digitization group, we can make use of herbarium labels and known metadata to train deep learning models to detect a number of different features of the specimens in our herbarium.
For example, we have already made quick headway on building a model for taxonomic identification of ferns. Using 140,000 images of individual herbarium specimens (and validating our model using another 35,000 images), we trained a deep learning model to identify the genus of the specimen on the herbarium sheet with an average of 90.5% accuracy across 86 genera in the dataset. Rare genera were not included (we only included genera with more than 500 specimens in the herbarium), and for many genera accuracy was well above 90% (100%, 99%, 99% are the top three). Moreover, we are finding biologically compelling errors in the model – mismatches between the predicted genus and the true genus are often between closely related genera, many of which have been recently split based on microscopic characters. This gives us confidence that it is the shapes of the specimens themselves that the model is cued in on.
Moving forward, there are a number of different avenues for further development of this model. I am particularly interested in how traits are distributed in space, and we hope to use this deep learning approach to extract quantitative traits, with a particular focus on leaf shape. Many have used leaf shape as a proxy for climate, and we would like to evaluate those hypotheses by examining leaf shape across geographic space within a large taxonomic subset of the herbarium. This will require development of a deep learning-based pipeline that can extract leaves from images and quantify their shape, while also maintaining enough generality to be applicable to examining specimens from thousands of taxa.
Machine learning is indeed the antithesis of expert opinion – for this reason, many biologists question the ability of such models to generate useful output. Yet it is critical to consider these models as a tool rather than a statistical test – as with any tool, they are mainly useful for specific tasks and it is entirely up to the practitioner to choose the right tool for the job. As computational methods become more complex, there is no doubt a danger in their opacity, and without the proper understanding of their assumptions and tendencies, one might be led astray by their application. However, such unbiased models are also powerful in their ability to identify patterns without any preconception about the outcome. In this way, they balance our human tendency to see the data as we would like to and help us discover patterns that we would otherwise never be able to observe. There are no doubt more advances on the horizon in machine learning, and biologists should be poised to harness those tools to tackle our pressing concerns.
Comments
You can follow this conversation by subscribing to the comment feed for this post.