Science and the Data Revolution

Atul Butte

The scientific method is obsolete.

This is a bold statement in a room filled with scientists who have spent the better portion of their lives striving for the gold standard of science set forth by 3rd grade science fair projects:

  1. Observe the world.
  2. Ask a question.
  3. Make a hypothesis.
  4. Devise a testable experiment with one variable and everything else controlled.
  5. Compile the data and analyze the results.
  6. Revise the experiment or hypothesis and repeat as necessary until they match.
  7. Form a conclusion.
  8. Justify how your manageable (relatively simple) model makes predictions on a much wider scale.

But the predominating focus on producing data to answer a question, which up until now has served our scientific community faithfully, may no longer be the best method for useful discovery, according to Stanford scientist Atul Butte, MD, PhD.  We are amidst a data revolution that necessitates that good science be best performed backwards; instead of questions demanding data, we now have data demanding questions.

Dr. Butte brought to my attention this new paradigm in science when he gave his talk last month at the  Northern California Science Writer’s Association spring meeting.  The data is already here.  It is everywhere.  It is collected in parallel, crowd sourced, presorted, overflowing, searchable, and primed for retroactive mining.  Our accumulated data is so expansive that we may never work our way through it; why spend valuable time, resources, and funding (not to mention hearts, souls, and careers) recreating data that may already exist in a form more complete than a lone scientist could ever produce?

For many, the mere mention of stockpiled information conjures up thoughts of Facebook hoarding our personal data and selling it for profit.  Or perhaps (if you’re my dad anyway), the government stealing our data so that it may better trample upon our constitutional and God-given freedoms.  But where can we find data that is useful to science?  There are already more searchable sources out there than might be expected, and that number just increased.  One such site is the functional genomics repository GEO (Gene Expression Omnibus), NCBI’s searchable archive of microarray, next-generation sequencing, and other high-throughput functional genomic data submitted by the scientific community.  ArrayExpress is a similar site.  And why might researchers share their hard earned data with potential competition?  Collaboration, publicity, a backed-up archive, belief in a cause – and, often, because it’s mandated.   Just one Affymetrix gene chip can contain information for 10’s of thousands of genes – and scientists are required to make this information publicly available.

We have entered an era where any student with the ability to download music from the internet also has access to more genetic data than any one lab could produce.  No longer are our ideas chained to an institution filled with expensive equipment and expertise, and in a world where members of the younger generation are generally more programming-savy than your average PI, we may not be so far away from DIY drug discovery.

Chris Anderson of Wired magazine sums up the potential of data mining nicely in his article The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, “[Information in mass] forces us to view data mathematically first and establish a context for it later.”  He uses Google as an example for conquering the (advertisement) world by disregarding established methods and turning instead to applied mathematics – and lots and lots of data.  “[Google] didn’t pretend to know anything about the culture and conventions of advertising – it just assumed that better data, with better analytical tools, would win the day. And Google was right.”

Enter here the cries of any self respecting scientist, “But correlation does NOT prove causation!”  True, but neither does a simplified model in addressing the complexities of nature.  And the more we learn, the more we realize just how far off these models are at encompassing the entirety of a system.  We may never devise an experiment that takes into account all of the gene, epigenetic, and environmental factors of a disease, but with enough data, we may find undeniable links that biological experiments may never have found.

Dr. Butte’s research methods make a lot of sense, but I at first dismissed them as well-suited for his particular field (drug discovery) but not suited for my research.  But then I paused to wonder, “Am I so deeply embedded in the paradigm of performing my own experiments to answer my questions that the idea of compiling existing data in my field and re-asking more relevant questions is lost on me?”  I am intrigued to learn more – but with caution.  In Anderson’s article about the antiquated scientific method, he indicates that Google does not care why people do things, just that they do; the numbers drive.  For medicine, this is perhaps akin to trading in an understanding through mechanism for an understanding through what actually IS.  I believe that while we do need to focus more on correlating understated data, experiments to understand mechanism and how things work are still very important.  I am excited to see how far data mining will take us, but we must not lose site of the fact that at its most basic level, science must still be fueled by primary observation and data creation, and, in my current mind, experimentation and modeling will still hold an important role.  For now though, I believe that the glut of unanalyzed information contains a wealth of life enhancing knowledge just waiting to be discovered; let us not waste it in the pursuit of making the stockpile redundantly larger.

In a recent TEDMED talk, Dr. Butte leaves us with one last thought: Data is power, data is revolution, data is frozen knowledge … and it is the responsibility of researchers to share this data, and it is the responsibility of the public to mandate that this be done.


Leave a Reply