Natural history collections around the world contain over 1 billion specimens, and could reveal important changes in biological systems that have occurred over the past 100-200 years. In the internet age, you might expect that specimen data would exist in online databases, but this is not the case for most museums. The painstaking endeavor to make natural history collections digitally accessible requires huge data entry efforts and the coordination of interdisciplinary teams of scientists.
Biologists of diverse disciplines are increasingly collaborating with computer programmers to create efficient data management and dissemination systems. John Wieczorek is a programmer whose domestic partner, Dr. Eileen Lacey, happens to be curator of mammals in the Museum of Vertebrate Zoology (MVZ) at Berkeley. One day in 1997 Eileen came home from work and said to John, “Hey, they have this picture of a database on the wall in the museum, and I don’t think anybody there understands it. Why don’t you go in and see if you can help them out?” John was initially doubtful, until she came home two weeks later to say the same thing. He thought, “Well, she’s going to do this until I go look at that damn database picture on the wall,” so he agreed to go in for a meeting a few days later. To his surprise, that meeting turned out to be an interview where he accepted a position to develop a modern relational database that would handle all of the collection data in the museum. He has proceeded to become a leader in the global effort to make information from natural history museums accessible to anyone with an internet connection.
Wieczorek is but one important player in the ongoing, interdisciplinary efforts necessary to get collections data online and in a central location. The standardization and centralization of specimen databases—documenting everything from large mammals to tiny insects and plants— is essential as we attempt to understand historical biodiversity and how it has changed over the past century or more. Dr. Stan Blum, a biodiversity informatics specialist at the California Academy of Sciences, says that “We’ve been beating the drum on digitization for a long time and it seems to be steadily gathering momentum. But you have to keep beating the drum” to attract the investments needed for much larger collections.
Natural history collections in science, and the need to digitize
With the advent of increasingly sophisticated technologies, the use of natural history collections data has expanded from its traditional uses in taxonomy to studies in ecology, biogeography, pest management, disease transmission, conservation, and more. Scientists are using methods in chemical and molecular analysis to measure contaminant levels in preserved specimens, determine food items consumed, and clarify evolutionary relationships with genetic data from organisms. We can associate specimen occurrences and their attributes with geographic data on climate and land use. In this way, museums serve as repositories of valuable resources for understanding the biological effects of climate and habitat modifications.
Collections are increasingly responding to the demand for data by creating open-access databases, available for easy search and download. Vertebrate collections in particular have become highly accessible through online databases. The MVZ at Berkeley has long been a driving force behind this trend. Not only were they one of the first museums to completely digitize and geographically reference their entire collection of around 677,000 specimens, but they have also been part of several multi-institutional collaborations that continue to digitize collections from around the world. The MVZ has gone so far as to digitize historical field notes, photographs, annotated maps, gene sequences, and vocal recordings of frogs and birds. They are in the process of tagging and linking all of this auxiliary information to individual specimens.
On the other end of the data management spectrum, many collections have absolutely no means of searching specimen information from a central location. Most of these collections have simply been unable to muster the considerable resources required to digitize. Insects have particularly high diversity, and very large numbers of specimens exist in collections, so the task of digitizing has seemed impossible. Dr. Rosemary Gillespie, director of the Essig Museum of Entomology at Berkeley, says that “By and large there has been no concerted, coordinated effort to get that huge amount of data digitized. People in Entomology collections have just thrown up their hands, because it’s such a huge quantity of data.” Digitizing collections requires significant investments of time and money, and entomological collections have experienced dwindling personnel and funding.
The Essig Museum has recently launched a large effort to digitize 1.2 million arthropod specimens from eight institutions across California through a project called Calbug. Although it may seem late in the game compared to the MVZ, this is the largest effort so far attempted for insect collections in terms of specimen number, species number, and geographic area covered. If all goes well, the bees will catch up with the birds, and bring collections management and global change studies into the 21st century.
Some historical context
The MVZ first managed their data on paper when vertebrate collections were established at Berkeley in 1908. From that time until the late 1970s they used ledger folios and card catalogues to keep track of holdings, similar to the old catalogue system at public libraries. These cards and ledgers were used to locate specimens and find associated information such as date collected, locality, collector, and taxon name. Such lists provided a central place to look up information. The catalogues then became a valuable resource for efficiently digitizing the collections, eliminating the need to physically sort through specimen-drawers and cabinets across the museum.
Digitization began in 1979 at the MVZ, when Berkeley ran on mainframe computers. Faculty members and staff had access to keyboards and monitors, but no personal computer, and hundreds of individuals accessed a single mainframe. As Stan Blum explains, “They just plugged you in over the wire to the mainframe computer on campus.” The MVZ started digitizing mammals, reptiles, and amphibian records using the card catalogues, and then digitized bird collections directly from specimen labels. Concerted digitization continued until 1983, and included everything from the earliest specimen (a bird egg from 1843) to the most recent. Now, new specimens are digitized before they enter the collection.
Changes in computing technology became an impetus to revamp data management at the MVZ, when in the early 90s the MVZ learned that the university’s mainframe computer, where they had created and managed their database, would soon be decommissioned. This system had already exceeded its capacity, and they had 32 different databases for the collection. They needed a better way to manage a growing body of data—a database that could handle all the information typical of natural history collections. This database should be capable of relating a variety of data associated with specimens, including the basics (taxon name, collector, date, location), and extras like field notes and photographs.
Stan Blum originally entered the field of biodiversity informatics after receiving his PhD in Zoology. Blum says, “As a post-doc, I could see that the next curator position in Ichthyology wasn’t going to open up for another five years, and they come few and far between. In contrast, I knew the field of informatics was only going to grow.” For over twenty years he has been working full-time on projects that apply information technology to biodiversity science—helping museum scientists capture and manage data, and efficiently analyze it.
In 1995 Blum came to Berkeley and created the mysterious “database picture on the wall” of the MVZ. He worked very closely with MVZ’s Staff Curator of Mammals, Barbara Stein, to create a roadmap for what would become the collection database. He did this using a methodology known as object role modeling. Blum needed to understand the ins and outs of all collection information and workflows to design an effective database. This methodology explored the combinatorics of data in detail, i.e., how many entries can go into each field and how each field can and cannot be related to others. They spent a lot of time doing structured interviews, where he would ask about each different type of data at the museum, and what concepts should be included. For example, they’d discuss tissue samples by creating lists of all the possible tissue types, all the different vials tissues could be stored in, all the species with which they could be associated, specify the number of possible entries for each field, and map out the potential relationships among the fields.
The final conceptual model consisted of multiple figures and a companion document. A computer algorithm then converted the model to a logical data structure for the relational database. The result was a very complicated set of tables. “There is no way to parse something that big and complicated in a hierarchical way, you have to drill down in bits and chunks,” said Blum. He left the museum after his design work was finished, but his model remains the basis for databases that most natural history collections use today, and is loosely called the “Berkeley Model.”
When John Wieczorek came to MVZ in 1997, the abstract work of model design was over and his job was to implement the data plan. He first had to extract 32 MVZ-databases from the mainframe, and standardize all the different fields and data types. He integrated the information from these files into a modern database system for all of the specimen-data, along with auxiliary data such as sound files and field notes. Imagine organizing the house of a compulsive hoarder, sorting through junk to find anything of value, creating piles, and putting it all into a logical place. According to Joyce Gross, a programmer for Berkeley’s Natural History Museums consortium, “this kind of work gives a programmer nightmares, but is very satisfying in the end when the database is cleaned up.” Gross put the MVZ database online in 1999.
In 2005, Blum’s model was incorporated into the Arctos data-management system created by Dusty McDonald and Gordon Jarrell. With Arctos, users can search for and add information online through a web interface, which automatically populates the database. This is the open source software that the MVZ uses to manage its database today, along with collaborators such as the Museum of Southwestern Biology in New Mexico, the Museum of Comparative Zoology at Harvard and many others. The Berkeley model was also the basis for Specify, another open source program for biological collections data management, supported by the University of Kansas.
Increasing data sharing, accessibility, and value
The critical step after digitization is what John Wieczorek calls “data mobilization”– making the data readily available and increasing their value. Specimens of any particular species are generally dispersed among numerous museums, so it is important to make multiple institutions’ data available simultaneously. For the vertebrate collections, data would not come from a single warehouse, but directly from each institution using a Distributed Generic Information Retrieval (DiGIR) protocol, developed in a collaboration between Berkeley, the University of Kansas, and the California Academy of Sciences. The basic idea is that a user submits a query from a web portal, the portal sends the query out to all the participating institutions via DiGIR, and results are compiled and sent back in the form of a table, map, or downloadable file.
The MVZ helped lead the three multi-institutional collaborations that used DiGIR to provide simultaneous access to collections data online. The first was the Mammal Networked Information System (MaNIS), which originally consisted of 17 collections. Wieczorek says that, “Initially, there were a bunch of big name institutions that were skeptical. Some of them thought that digitization would never happen. Others said it sounded like a good idea but they weren’t ready. Before the end of funding for the project, however, all of the big institutions were on the waiting list to participate, and all have since become valuable contributors.”
MaNIS was wildly successful in creating new tools for data mobilization. They developed the point-radius method to standardize the process of geographically referencing text descriptions of specimen locations, and accounting for potential error. The MaNIS institutions then implemented a bulk georeferencing effort after all collections data were entered into a database. Each institution claimed specific regions to georeference, generally those for which they had good geographic knowledge – UC Berkeley claimed California, for example. Then the institutions georeferenced all localities from that region no matter which museum the specimens came from. Wiezcorek’s eyes lit up when he explained how people liked this way of doing things. “They—loved—it. Collaborative georeferencing created a community, because people from different museums had to talk to each other when issues came up. The collaboration and the sense of community might have been the best thing that ever came out of MaNIS.”
Having georeferenced localities increases the value of museum data. Not only can users now get vertebrate data online, but they can also easily create distribution maps showing specimen localities. When the HerpNET collaboration later formed among 52 herpetology collections, they had a successful example to follow, and were able to use the same collaborative georeferencing model, tools, and procedures.
Georeferencing methods are now applicable to collections in general. After MaNIS, the Moore Foundation awarded Berkeley a collaborative grant of $1.6 million dollars to further advance georeferencing and to establish best practices. Under this grant Wieczorek created BioGeomancer, a workbench and a web service to automatically georeference localities online. The last of the multi-institution digitization projects that the MVZ participated in was ORNIS, for bird specimens. ORNIS used BioGeomancer to double the speed of georeferencing. The MVZ now gives extended international workshops on georeferencing to train students, researchers, and collection staff, on campus and elsewhere, on the standard concepts and procedures.
The MVZ is now working to tackle the problem of sustainability. There are not currently enough resources to keep up with the demand of institutions that want to get involved in MaNIS, HerpNET, ORNIS, and their sister vertebrate network, FishNet. The MVZ, under Principal Investigator Carla Cicero, just received an NSF grant to create and coordinate VertNet, which would combine data from the all vertebrate disciplines. The goal is to streamline the process of data publishing, remove the need for participating institutions to maintain their own servers, and increase the performance and capabilities of the networks under one cloud-based platform. Though early in the project, Wieczorek and colleague Aaron Steele, are confident that the 176 participating collections and the waiting list of 61 more can all be transitioned into the new VertNet within the first year and a half of the three-year project. The economic impact will be an estimated factor-of-20 savings in the cost of maintaining the network.
Potential new approaches for large Entomology collections
The question now is whether we can use the models provided by past and ongoing efforts on the smaller vertebrate collections to digitize and make accessible the enormous amount of data associated with invertebrate collections. Digitization projects for Entomology collections face incredible challenges, despite the helpful precedents from the MVZ and other vertebrate collections. The Essig museum at Berkeley alone has more than 6 million specimens in its collection. Calbug has an NSF grant to digitize 1.2 million specimens from collections across California using methods similar to those from MVZ’s digitization efforts. While this is one of the largest attempts to digitize arthropod collections, it is only a small fraction of the specimens in California collections.
Unique challenges associated with the insect specimens compound the problem of huge collection sizes; specimens are small and delicate, and the labels are tiny and difficult to read. The labels and specimens also have pins sticking through them, obscuring information. For these collections, the Essig team must find a way to mass-process data from specimens and automate portions of the current workflow.
One approach that Calbug has been experimenting with is taking photographs of individual specimens in the collection, so that students and volunteers can enter data from images. It takes a significant amount of time to arrange specimens and labels such that all of the labels stacked on a pin beneath the specimen are legible. Dr. Peter Oboyski, a post-doc working with Calbug, envisions a Ford-style assembly line that would have a designated station to run specimens through that is set up for efficient image capture.
Essig staff are experimenting with new imaging techniques and methods to simplify workflow during the image-capture process. One option is to capture high-resolution images of entire drawers. Drawer-scanning technology takes multiple images from various angles across the drawer and stitches them together to create a single high-resolution image. With the final mosaic image, one can then zoom from a large picture of the entire drawer of 200 insects into tiny hairs on the leg of a single specimen.
Drawer-scanning technology has not been applied to photographing insect labels, but rather to get high-resolution images of the specimens themselves. Obtaining data from labels is more difficult. Multiple stacked labels rest below the specimen, containing useful information on the time and place of collection. Essig staff still must arrange specimens so that all labels are visible, a process that is extremely time consuming when working with large numbers. One potential benefit of scanning drawers would be to simplify workflow by eliminating the necessity to shoot single photos of specimens and save individual files manually.
The British Museum of Natural History has partnered with Smart Drive Ltd. to create the SatScan tray scanner, and to develop software that can crop images of individual specimens and save files automatically using standard filenames. GigaPan is a similar technology, developed by the NASA Ames Research Center, Google, and the Carnegie Mellon Museum of Natural History, that uses a camera mount suitable for DSLR cameras to take photos automatically from different angles throughout a drawer, and software to stitch the images together.
The goal for all of these efforts is to increase the current production rate of 5,500 images every two weeks to at least 12,000 images. Essig staff are timing different methods to identify the most efficient one. Undergraduate students are essential to carrying out this work; as Oboyski put it, “Most projects on this campus would never get done without undergraduate help. They are the unsung heroes of research on campus.”
The Calbug team plans to use specimen images together with large-scale approaches, such as crowd sourcing and automatic text recognition, to put the data into a useful structure. They are planning to collaborate with the Citizen Science Alliance, an organization that has been very successful in developing web interfaces for citizen science projects. One of their projects involves digitizing weather logs from Royal Navy ships around the time of WWI; on launch day of the project they had 100,000 pages digitized. The crowd sourcing approach provides a mechanism for hundreds or thousands of people to do data entry work and to learn about collections and global change research at the same time.
It may also be possible to develop optical character recognition (OCR) software that could interpret information automatically from insect labels. Current OCR software does not recognize the type of information on insect labels, which includes handwriting, typeface, and various abbreviations for the same word (e.g., California might be CA, Cal, or Calif). Calbug would have to build a dictionary of all possible abbreviations for all the words that may be found in the labels. They will be working on this dictionary to further explore the possibility of creating OCR software, and to create lookup tables for citizen science data entry.
Following in the footsteps of the MVZ, and benefitting from the technologies already developed, the Essig museum may become a leader in digitization of larger invertebrate collections. If successful, they will increase the scale of data capture by orders of magnitude. This would pave the way for collections throughout the world to digitize massive numbers of specimens. Each specimen, each drawer, each taxonomic group, and each of the natural history collections has the potential to add something important to our understanding of biodiversity and global change. When we begin to combine collections throughout a state, a country, or the world, we will undoubtedly see something profound.