From Plant Press, Vol. 21, No. 2, April 2018.
By Robert Edwards
For those who have had Saturday afternoons in the field happily keying out wildflowers rudely stymied by yet another Damn Yellow Comp, the question “why are there so many daisies?!” may have sprung to mind (possibly not quite so politely). An answer: that the environment plays a primary role in driving the generation and distribution of plant diversity and community composition, will have occurred to many, and is hardly a novel concept. Yet due to a lack of data at sufficiently broad scales our ability to identify continent-wide patterns and fundamental relationships between ecology and diversity has until recently been limited to mostly descriptive work or studies of small numbers of taxa. Within the last decade however, advancing computer power and storage, coupled with museum digitization initiatives, has begun to open up huge repositories of data to research that can be used to start tackling questions on a grander scale.
Monoptilon bellidiforme (Astereae tribe) showing a preference for highly quartzitic soil and challenging extremes in climate in Death Valley, Nevada. (photo by R. Edwards)
Almost one in ten flowering plants on the North American continent is from the daisy family (Compositae) with unusually high diversity in southwestern United States and northern Mexico; however, the exact origins of this diversity remain unclear. It has been proposed that a cooling and drying trend since the mid-Eocene has allowed smaller herbaceous plants to thrive at the expense of woodier species, with more recent shuffling and mosaicism of communities during glacial ups-and-downs over the last 25,000 years exposing a highly heterogeneous landscape with plenty of previously unoccupied niches. The Compositae are particular adept at colonizing a vast array of niches, including those generally considered environmentally challenging and apparently inhospitable to many other plant groups, and during this time underwent several large and relatively rapid radiations. As such they make a perfect group for using Big Data to test whether extremes in particular environmental factors may be responsible for driving species diversity at large scales. My colleagues and I have chosen 14 tribes within the Compositae with predominantly North American distributions to study, allowing us to compare and contrast patterns across lineages.
The greatest challenge facing the harnessing of large and agglomerated data-sets is assessing and dealing with (often poor) data quality. The time consumed cleaning data is almost always underestimated, and while there are as-is statistical analysis packages that aim to tie many common tasks together in a relatively accessible way (see various R packages – a programming language for statistical computing and graphics), these never cover all contingencies, with different issues unique to every dataset and question.
For our work we engineered a data acquisition, cleaning, and analysis pipeline using a variety of computer programs, including R, Excel (still one of the most useful tools for screening, merging, and wrangling data), OpenRefine, ArcMap, and Biodiverse. Collection records were collated from the three largest publically available databases for North American biological specimen data—GBIF, iDigBio, and BISON. While much is shared between these repositories they are not entirely overlapping and the quality of data can vary markedly between them. In fact, the decision to use all three despite high redundancy allowed us to: a) scrape all possible data available at the time; and b) cross-reference between supposedly duplicate data from all three servers to identify issues or errors. Curation of the data behind-the-scenes can be rather opaque and this approach allowed us to identify and compare different taxonomies between databases (or even within a database) including outing an over-zealous GBIF synonymization algorithm that was merging related taxa with the same generic first letter, species name, and author initial (eg. Helianthus atrorubens L. with Hebeclinium atrorubens Lemaire).
An initial data-dump of close to 2 million records was reduced by three quarters before analysis. Many removed records were straight duplicates; however, a surprising number were para-duplicates – records with small inconsequential differences such as differently rounded geo-coordinates, present or absent collector initials, etc. The remaining records were scrubbed for weeds, garden-grown collections, gross georeferencing errors, and guestimated georeferences (the White House is home to upwards of 200 species of Compositae alone if raw GBIF data is to be believed as it is a commonly used centroid for records with no better locality information than “the district”). The removal of synonyms both between and within databases also took a lot of manual inspection and reference-hunting.
A final list of close to 500,000 records for over 3,000 species was curated. Values for 187 soil, geochemistry, topography, and climate variables were extracted for each point, with correlation analyses reducing the final set to be considered down to 50. A metaphylogeny was constructed using a Genbank and an Open Tree of Life backbone, with unplaced taxa grafted on according to expert opinion.
Armed with this data we can finally address the questions: Where are centers of diversity for North American Compositae? Are these similar across lineages? What environmental variables are correlated with increased or decreased diversity, and how do these differ across lineages? Does diversity appear to have a predictable response to certain variables through space? Are particular variables associated with more diverse clades?