From the field

By Dat Mai

November 25, 2018

With a Starbucks coffee in hand, I enter the Valley Life Sciences Building and shuffle past the looming resident T-Rex skeleton. I make a right at a prehistoric, fishlike fossil and take the seldom-used freight elevator to the lab—office, really. I walk past some headphone-clad lab members as they tap away at their keyboards. When I reach my work station, I set up my laptop, sip on my coffee, and muffle out all other noise with some lo-fi jazz through my headphones. I open up the terminal—the black and green screen we see every hacker play around with in movies—and log into Savio, an off-site computer cluster offered by UC Berkeley for high-performance computing. While a regular laptop can deal with anything ranging from playing music to opening too many tabs on Firefox to gaming, the research that I do requires many times the processing power in order to have anything done reasonably quickly.

I need these intensive computing resources, because I’m a computational biologist. It might sound strange to many. The first thing that might come to mind when we hear “biology” is the study of life, which includes going out in the wild to study interesting organisms or doing some cutting-edge lab experiments, while the word “computational” immediately brings us back to an office setting devoid of life. The field of biology is heading towards the world of big data. Companies like 23andme have obtained genetic data from millions of individuals, generating obscene amounts of data—someone needs to sift through and make sense of it all. That’s where computational biologists like me come in: we use various statistical methods to analyze these large amounts of biological data and hope to glean patterns to answer questions in biology.

I aim to find patterns in DNA that prevent different species from mating with each other in nature by using different species of fruit flies in the Drosophila nasuta group. In nature, different species of these flies don’t mate with each other, but put them in close proximity in the lab and they’re likely to create hybrid flies—those that have DNA from different species. If that’s the case, why don’t we see hybrids in nature? Some potential reasons for this phenomenon could be that hybrid flies are unable to compete with non-hybrid flies for resources in the wild, or that female flies tend to reject male flies of a different species if other same-species males are around.

Our lab has set up what we call hybrid cages, each of which comprises a clear plastic cube around twice the size of a human head, a round opening covered by pantyhose allowing air to go through but not flies, and, inside, flies from two different species and the hybrid offspring they’ve created. The goal is to sequence the hybrids and look for overall patterns in their DNA. Specifically, I’m looking for parts of their genome that are correlated; i.e., two genes are usually from the same species and rarely a combination of two different species. The rationale behind this is that, after many generations of mating and hybridizing, the hybrid flies should have all sorts of combinations of DNA from both species. However, genomic regions that are correlated might be those that cause these different fly species from hybridizing in the wild.

A few hours later, a lab technician sends me an email stating that QB3, a sequencing facility on campus, has finished sequencing the DNA of a few hundred hybrid flies. This is great news. I take another swig of coffee. Typing less than 50 letters into the terminal, I transfer the files—one for each sequenced hybrid—from QB3’s computer server to Savio. The names of these hundreds of files flash in front of me in less than a minute and the transfer is complete. I next begin the painstaking process of doing quality control on my data. As with all experiments, sequencing isn’t perfect. The results I get—millions of lines filled with the A’s, T’s, C’s, and G’s that represent the bases in the fruit flies’ DNA—have been pushed through a statistical model that tells me which parts of my results are likely to be accurate, e.g., whether an A in a particular position is really an A. I write a quick script using Vim, my text editor of choice, where I loop through each file and run a program that spits out key summary statistics about the quality of my data. From experience, the quality checking process will take around an hour or three. I take a few more sips of coffee and check the results every few minutes as they come.

I lucked out today. This process finished in two hours, and I had to do minimal data cleaning—removing parts of the data that are unlikely to be accurate. Now I need to run the processed data through a pipeline that uses a machine learning algorithm called a Hidden Markov Model that will look at sections of the genome of a fly and determine if each region contains DNA from only one species or from both species. These are the first of many steps that I will need to take to see which regions in the fly genome are correlated with each other. I know for certain that this program will take at least two days to run. I drink a bit more of my coffee, grateful that I got a grande-sized cup.

It sounds like I have the next two days off, but science isn’t that kind. I need use the apparent down time to read up on the pipeline’s output data format, what it means, and how these results are calculated by the pipeline. I take the next hour or so to read up on the pipeline’s documentation to get a better understanding of how it works and what type of results come out at the end. Once I’m somewhat comfortable with what to expect, I decide to write some code to process and transform the results so they can be used with other software that will estimate the correlation between genomic regions. However, doing this will require the code to go through multiple files, analyzing them, and regurgitating a new one in the format I need—all without taking too long to run and too many computational resources. I open up Vim again and start coding in a coding language called Python to do this. Before long, it’s already 10 p.m. and, in true programmer fashion, I have a large chunk of code that won’t work. I finish up the sips of the now-cold coffee and give up for the night. I have another two days to work on this anyway.


Dat Mai is a graduate student in integrative biology

This article is part of the Fall 2018 issue.