By Carolyn Sheffield, Field Book Project
As part of our Beyond the Field Book Project section of this blog, we have been conducting a series of interviews to learn more about the value of field books for research and some of the challenges in trying to access them. This week, I’m very pleased to share a recent interview with David Bloom (and his co-conspirators) on a very exciting, multi-institutional project that successfully addresses some of those common challenges. David Bloom is the VertNet Coordinator and one-fifth of the Henderson Field Notes Project. Read on for an inspiring (and fun!) look at what this group was able to achieve in the space of one year.
Here at the Field Book Project, we’ve been reading the So You Think You Can Digitize blog, and it has been exciting to follow the work done on the Henderson Field Notes Project. For those readers who may be hearing of it for the first time, how would you describe the Henderson Field Notes Project?
The Henderson Field Notes Project is an exploration of what is possible once a large volume of natural history-oriented material is digitized. In our case, this happens to be a set of fourteen volumes of field notes written between 1905 and 1919 by Junius Henderson, founder and first curator of the University of Colorado’s Museum of Natural History (CU Museum). Prior to the project we were exploring various questions about how, what, and why to digitize, but we realized that to answer these questions we needed some understanding of what we might want to do with a treasure trove of digitized data once we had it.
We had 5 initial goals:
1. To make Henderson’s notes accessible to the public, easily discoverable, and, preferably, bundled with appropriate descriptive, structural, and preservation metadata;
2. To accomplish Goal #1 using the least restrictive licensing available;
3. To use some of the automated data extraction tools we’ve discovered to do things such as linking names of taxa, places, people, and dates to other sources of biodiversity knowledge;
4. To produce at least one “Nifty Thing” as a result of this project — like a map on Google Earth showing Henderson’s travels or a fancy coffee mug adorned by a great picture of Henderson in the field;
5. To spend no more than five hours per person on this experiment — because we like the idea of discovering what substantial products can be produced on a budget of no money and close-to-no time.
We can celebrate, a year later, that we accomplished and documented (on our blog and in a special issue of ZooKeys) four of our five goals. We’ll let your readers decide which of our goals we failed to realize.
What is your background and how did you become involved? What is your role on the project?
As is the case with any epic adventure, the journey starts modestly only to balloon into a vast army of protagonists sharing a common goal. The Henderson Field Note Project team includes an army of five.
Rob Guralnick and Andrea Thomer are principally to blame for the Henderson Project. They identified Henderson’s field books as an opportunity to move the conversation about digitization forward. In doing so, they added Gaurav Vaidya to their corps for his Wikipedia-oriented expertise and to seek solutions that could bring both sets of Henderson data into a useful format.
Because an epic adventure would be incomplete without some co-opted innocent bystanders, Rob, Andrea, and Gaurav captured David Bloom and Laura Russell; David for his interest in citizen science and uncanny ability to herd cats, and Laura for her unparalleled expertise with Darwin Core and abilities to solve the problems that other programmers stumble upon when seeking Wikipedia-oriented solutions.
What were some of the challenges you encountered?
1. Crowdsourcing: How were we going to communicate with and excite a group of people who we couldn’t be certain even existed to visit our wiki pages and help us by becoming virtual participants and edit transcriptions, tag elements in the text, and generally keep our project from taking years and years to complete? One big solution was our blog. Fortunately, our army of five includes no wallflowers, so we already had a following as a result of other projects, including multiple social media accounts. We put out calls for volunteers and posted some fairly detailed instructions for participation on the blog. The result was fantastic. We had lots of people jump in to help — most of whom chose to remain anonymous on Wikisource (and no, not knowing the identities of the majority of our volunteers was never considered to be a liability or a challenge).
2. Wikisource vs. other platforms: We had a lot of options before us and we could have used any number of them to meet our goals. We knew that any tool or tools we selected needed to allow us to make Henderson’s notes easily discoverable, publicly accessible, freely reusable, and preserved sustainably so we could extract taxonomic occurrences from them. After plenty of research and tinkering, by Gaurav in particular, we settled on Wikisource as our primary tool. We believe that Wikisource provided the best combination of ease of use, open access, an existing community of developers and users, and a set of existing templates from which we could launch our efforts (read more in our Zookeys paper). We realized that Wikisource was not perfect, but it was about all that we could have hoped for given that we had no idea where we were headed at the start. Despite the imperfection, Wikisource continues to be a great foundation for discussion and discovery. We still have many new questions about Wikisource, as well as all other aspects of the project, which is, of course, just about the best set of results for which we could have asked.
3. Moving Taxonomic Records From Text to Spreadsheet: It is no trifle to take a written observation of a taxon from a field book, extract it and its associated metadata from the page, insert it into a spreadsheet that is organized to meet a global standard of biodiversity data exchange, such as Darwin Core, in an efficient and accurate manner. After several long conference calls and some trial and many errors we did find a way to make this happen (see “Data extraction: Seeking efficiency and accuracy”). It may not be the best way to do it, and we’re still toying with other options, but it was effective enough for us to extract taxonomic records, dates, and localities from the first three volumes of Henderson’s field books and publish them as a Darwin Core Archive.
3.5. Related to the creation of a Darwin Core Archive with taxonomic records was a self-imposed challenge to link each observation back to the page(s) of the field notes from which it was pulled. The primary reason for this was to provide researchers and enthusiasts alike with the opportunity to read about each observation in context. This effort was influential in the evolution of our understanding of an endeavor such as this; in particular, it was the source of many new questions about why viewing data in context could be important to a line of research — absence versus presence, being one such instance.
4. Name Resolution and Proofing the Darwin Core Data Set: There isn’t a biologist alive today who can’t point to an issue of classification and naming that edges the study of their beloved taxon of choice toward chaos. During the course of our project we were confronted by many issues of idiosyncratic naming protocols, confused synonymies, conflicting naming services, and, in some cases, a lack of proper names altogether. To solve our problems, at least for the project, Laura did a lot of heavy lifting to compare the accuracy of several services, such as ITIS and EOL, and Gaurav followed up by checking each taxonomic reference in our data set that had conflicting options from these services and selected the best match. Finally, Rob reviewed several of the volumes manually for errors in formatting, labeling, and annotations created by our volunteer corps. All the while Andrea and Dave sat around drinking mint juleps and discussing the merits of georeferencing all of the records in the data set (sadly, we haven’t managed to complete that part of our project...yet). Again, we wrote about this in detail in our paper.
How do you envision the results of the project being used?
We do have aspirations to continue the project and to see the annotation of all fourteen books completed and the taxonomic records extracted into a single Darwin Core Archive, complete with georeferences and other data quality improvements. If we learned one thing from this project, it is that it takes just as much energy to rally the volunteers and keep them working as it does for us to find the time in our own schedules to keep a steady stream of transcriptions and scans available with which they can keep themselves amused.
Are there other projects you’d like to see happen, or like to be involved in, that involve improving access to the content of field books?
It is our nature to want to be involved with everything that everyone is doing all the time with every set of field notes at every institution ever. If we could, we’d love to be the founding staff at the National Archives of Biodiversity Writing, Illustration, Events, Ramblings, and Discovery (NABWIERD) and to have the opportunity to work with field books from far and wide every day. We’d have the best marketing campaign. EVER!
In the meantime, we’re very happy to continue to find ways to use the Henderson Project to expand our personal knowledge as well as that of the larger (often interdisciplinary) community interested in digitization, archives, and museum collections. We are always interested in a challenge, and we remain willing to help anyone who asks. That is, after all, the role of an epic army of digital adventurers, such as ours. All anyone needs to do is ask.
Anything else you would like to add?
Yeah, folks can “ask” us at:
[email protected] (http://twitter.com/an_dre_a_)
[email protected] (http://twitter.com/dabblepop)
[email protected] (http://twitter.com/mrvaidya)
[email protected] (http://twitter.com/robgural)
[email protected] (http://twitter.com/pagodarose)