If a tree falls in a forest, and a microphone picks it up and uploads the recording to an obscure archive where no one ever listens to it, does it make a sound? Philosophical matters aside, questions like this point to one of the central challenges facing the natural sciences in the information age. Thanks to improvements in data collection technologies over the years, scientific data in many fields is being generated at astronomically higher rates than in the past. Although this sounds like good news (and I doubt that the data-starved scientists of yesteryear would be complaining much), researchers often find themselves struggling to keep pace with the deluge of information. The question we must confront is how to design a system in which vast influxes of data can be efficiently accessed, vetted and analyzed by scientists around the globe.
One of the key figures influencing scientific data management in the 21st century is Edward Seidel, assistant director of the National Science Foundation (NSF). As an astrophysicist, Seidel tackled complex modeling problems, such as the time evolution of black holes, by assembling teams that consisted of physicists as well as computer simulation and visualization experts. Now a policymaker, he continues to advocate at a national level for increased collaboration between experts in science and computers.
Seidel spoke in March (you can watch the talk online) at UC Berkeley about recent updates to NSF’s data management policies that reflect the rising importance of cyberinfrastructure in scientific research. Last year, NSF began requiring any new project seeking agency funding to submit a comprehensive data management plan that is subject to peer review. The goal is to encourage the scientific community to implement sophisticated data management techniques capable of keeping pace with the growing complexity of the data itself.
An important aspect of the new policy is its recognition that not all types of data can be covered by a specific set of universal policies. It would be impractical, if not impossible, for NSF to efficiently legislate diverse data sets like satellite and telescope images, DNA sequences and neural maps. Furthermore, different fields share data in different ways; some, like climatology and astronomy, openly distribute much of their data over the internet; others, like ecology, traditionally keep their data private. NSF’s new policy gives researchers flexibility, while creating a strong incentive for those seeking funding to be innovative about the way they manage data.
Although setting up scientific cyberinfrastructure at a global scale is a difficult task, it is certainly feasible. Companies like Google, Facebook, and Oracle have already succeeded at organizing huge amounts of data collected from around the globe. Ultimately, the challenge for scientists boils down to determining who will front the costs associated with data management. Large, government-funded archival facilities are one approach to minimizing the costs of data management for individual labs (see ESO/ST-ECF Science Archive Facility for astronomers and the International Tree Ring Data Bank for climatologists). But with data being created at even higher rates from a wide range of sources, more scalable solutions (likely passed down from the giants of the online industry) may be needed.
It is hard to overstate the importance of scientific data management in the coming years. A scientific finding is only as valid as the data that supports it, and if crucial pieces of information become lost or misinterpreted as a result of an overgrown system, all bets are off. The stakes are particularly high in politically charged fields like climate science, where the passage of vital legislation depends on the ability of researchers to defend the validity of their data. The good news is that solutions exist, at least as far as the computational technology is concerned. The cost, while substantial, is decreasing. And above all scientists can take comfort in the fact that computers haven’t learned to lie to us. Or have they?