In 1972, geneticist Susumu Ohno coined the term “junk DNA” to describe every component of the human genome that was not a gene. Suspicious of the assumption that all three billion base pairs of human DNA were functionally important, Ohno wrote, “Triumphs as well as failures of nature’s past experiments appear to be contained in our genome.” Nearly a decade later, Francis Crick and Leslie Orgel published a review in Nature entitled “Selfish DNA: the ultimate parasite,” arguing that most DNA in higher organisms was, similarly, “little better than junk.”
For many years, the idea that the genome was divided cleanly into two categories— short stretches of genes interspersed among long spans of junk—was a widely accepted view. But by the early 1990s, the concept had begun to grow stale. Geneticists were gradually uncovering more and more functionally significant roles within the “junk” regions, and the very definition of a gene itself was beginning to change. Nevertheless, when the full sequence of the human genome was finally published in 2004, many people were shocked to discover just how few genes our DNA actually contains. Representing only two percent of the entire genome, genes were vastly outnumbered by mysterious noncoding regions. But if this “dark genome” really wasn’t junk, what could it all be doing?
“When you first think about genetics 15-20 years ago, the goal was simply to understand the code—the code as it related to genes, gene expression, and the production of proteins,” says Gary Karpen, a senior staff scientist in the Life Sciences Division of Lawrence Berkeley National Laboratory (LBL). “But then it became clear that the code was simply not enough.” Karpen and a team of over 150 other scientists have just completed an ambitious project whose aims were, according to Karpen, “the next level up” from straight code—at the level of mapping function in the dark genome. What is emerging is a far better idea of the importance of this largely unexplored genetic landscape, a picture of DNA as a dynamic template for life.
The birth of modENCODE
The project, called the model organism Encyclopedia of DNA Elements (modENCODE), was born out of a sister initiative launched in 2003 called ENCODE, which aimed to catalog the complete “parts list” of the entire human genome. The pilot phase of ENCODE centered on annotating only one percent of human DNA, but the complexity of the human genome and the limits of technology at the time necessitated a slight shift in focus.
Thus, in 2007 the National Human Genome Research Institute (NHGRI) launched modENCODE as a parallel effort involving two simpler subjects: the roundworm Caenorhabditis elegans and the fruit fly Drosophila melanogaster. The four-year, $57 million project hoped to identify, if possible, the functional role of every base in the worm and fruit fly genomes. These two model organisms represent far better understood genetic systems than the human genome and, at 100 and 180 million base pairs each, far more feasible approaches to the genome-wide analysis NHGRI aimed to achieve. The hope was that ultimately modENCODE could serve as an extended pilot for the entire human ENCODE project, helping us better understand how it is that complex, three-dimensional organisms arise out of linear strands of DNA.
A manifold blueprint
DNA is made up of four different molecules called nucleotides, paired and bound together to form the two anti-parallel twisting threads of the double helix. Some segments of DNA are known as genes, meaning that their nucleotides will be transcribed into a slightly different chemical form called RNA. A specific type of this RNA—called messenger RNA, or mRNA—will then leave the nucleus to serve as a template for synthesis of the protein building blocks that carry out our cellular processes. Proteins not only make up the structural framework of our cells, they also catalyze most of the chemical reactions that make cells work.
Yet all cells, from kidney cells to neurons to muscle cells, possess exactly the same copy of DNA. In its entirety, DNA exists only as a template from which an immense number of readouts can occur; not all genes are expressed at all times in all cells, and it is precisely this capacity for different combinations of expression that allows for the astonishing diversity of our cellular processes. Geneticists are still unclear exactly how these highly ordered patterns of gene expression are achieved. The answer may lie in the dark genome.
From base to function
The architects of the modENCODE project sought to chip away at this question by first assembling a map. By annotating the function of every base of DNA in the two model organisms, they hoped to gain some insight into how transcription is regulated across cell types and throughout development.
They analyzed function along two broad sets of factors. The first set, referred to as “functional elements,” include small proteins that regulate transcription, as well as non-coding RNAs (ncRNAs) that help to regulate gene expression after transcription but before protein synthesis. The second set, known as epigenetic elements, are not contained in the sequence of DNA itself, but include chemical marks on the surface of DNA that physically influence what regions of the genome are silent or active. Over 50 participating labs around the world analyzed specific types of functional or epigenetic elements in one of the two model organisms to assemble a topographical map of function along the linear DNA sequence.
“We wanted to crack the code to discover the rules required to read a genome—any genome,” says Susan Celniker, head of the Department of Genome Dynamics at LBL who, along with Karpen, was one of the senior principal investigators for modENCODE. Her lab was on the Drosophila team and was responsible for mapping out the entire transcriptome—all of the sequences of DNA that are transcribed into RNA.
Counting both coding and non-coding RNAs, the transcriptome comprises about 60 percent of the fly genome. In order to screen such vast amounts of RNA with single-base resolution, Celniker’s group used a high-throughput technique known as RNA-seq. Investigators isolate the more than 25 million scattered fragments of RNA that have been transcribed from DNA. After making some chemical modifications that allow sequencing to occur, they convert the RNA back to DNA through a process called reverse transcription, giving them the coding DNA, or cDNA, for the original set of RNA fragments. They then sequence the cDNA and align it with the original genome sequence to map the transcriptome.
Celniker’s group generated almost six thousand-fold coverage of the previously annotated fly transcriptome. Combing through their RNA-seq data, they identified nearly two thousand new transcribed regions that had been missed in previous annotations. These new regions include sequences that encode small proteins, as well as small non-coding RNAs that participate in the regulatory machinery that help control gene expression and protein production. In perhaps their highest-impact finding, Celniker’s group identified over 22,000 new splice junctions—areas where, after transcription, distinct chunks of transcripts can be cut out, allowing for different combinations of mRNA. Alternative splicing thus allows a single gene to code Features Dark genome for several different proteins, based on the different possible patterns of cutting and pasting.
The discovery of the vast number of previously unidentified splice junctions and new transcripts gives us a far better idea of the sheer quantity of potential protein products in each cell. Insight into an additional layer function, however, is provided by the identification of the new non-coding RNAs, many of which are involved in splicingevents, promoting or repressing transcription, or silencing mRNAs to finely control levels of protein synthesis. The overlapping output of these two mechanisms—variety of combinations within transcripts and an intricate regulatory machinery—is crucial to understanding our genome’s differential workings from cell to cell.
Illustrating this, Celniker’s group then carried out comparisons across 27 distinct developmental stages as well as between the sexes. Interestingly, they found that the number of expressed genes increases from around 7,000 in embryonic flies to around 12,000 in adults. They also analyzed changes in expression patterns of specific genes across development, finding genes that are highly upregulated in the larval developmental stages and then essentially shut off as the fly matures. Between the sexes, they noted that adult males express around 3,000 more genes than their female counterparts. The functions of all of these genes are not yet known, but they are all clearly implicated in development—both across time and between sexes. Celniker hopes that her group’s identification of the genes will spark more targeted research in the Drosophila community.
“For me,” says Celniker, “the project will not be over until I know exactly how a single cell with its single copy of DNA turns into a complex organism like the fly. We’re not there yet, but we’re certainly assembling the building blocks.”
The chromatin landscape
With 100 and 180 million base pairs even in organisms as “simple” as the worm and the fruit fly, each copy of DNA is simply too long to exist as a linear molecule in a tiny cell. Instead, it is condensed and packaged into chromosome pairs—the worm has six and the fruit fly has four, while humans have 23. Chromosomes are made up of chromatin, which consists of DNA wrapped around clusters of tiny proteins called histones, arranged along the DNA like beads on a string. These histone-DNA spools then supercoil around themselves in meandering loops and folds, finally forming the tightly-packed structure of chromatin.
Karpen and his lab at LBL study what is called the “chromatin landscape” of the fruit fly—the hundreds of chemical tags that can be added to histones to ultimately affect levels of transcription. The modifications are then recognized by the cellular machinery that respond to these chemical signals, resulting in silencing or activation of the DNA in the tagged region of chromatin. Histone modifications are one of several types of epigenetic mechanisms that influence gene expression. They are not encoded within the genome; rather, they impact the readout of DNA through changes to the protein components of chromatin. These epigenetic changes are also heritable, meaning the modifications are passed along through cell divisions and can lead to unique signatures amongst different cell types.
“Most of the time, people have studied these histone modifications in isolation,” says Karpen. “But what we were interested in is how they work in combination.” Using a method called chromatin immunoprecipitation (ChIP) and high-throughput sequencing, Karpen and his group were able to identify chromatin marks associated with various regions of the fly genome. By looking at different combinations of 18 specific chromatin marks, they delineated about 30 distinct chromatin states correlated with the position of genes and their levels of expression. These states included highly predictable associations with transcription start sites, gene length, silent or active regions, and even gene function. “There’s an issue here with cause and effect,” Karpen says. “It’s not just the type of modification that’s important, but where the modification is, which histone, which amino acid in that histone, what recognizes that modification, what other proteins are brought in—there’s a lot of complexity.”
Karpen stresses that this is just the beginning of this type of broader analysis of chromatin marks; although they thoroughly characterized 18 histone modifications, hundreds remain. Regardless, Karpen’s work adds another topographical layer to the genomic landscape. While functional elements control gene expression at the level of DNA and RNA, transcription and protein synthesis, epigenetic elements allow for yet another route of cell divergence— one that occurs above the level of DNA sequence. “This is really the level of dynamic genomics,” Karpen says. “I have to say, I just find the fact that we know so little incredibly exciting.”
From map to model
Once the individual research groups had all assembled their final data, Drosophila modENCODE had over 700 datasets profiling transcripts, histone modifications, and replication programs. Karpen, Celniker, and the rest of the Drosophila team then submitted their finished datasets to Manolis Kellis, head of the Computational Biology Group at Massachusetts Institute of Technology. Kellis headed the modENCODE Data Analysis Center, which took all of the finished data and integrated it into a coherent story, creating the predictive and comparative genomics models that the consortium hopes will eventually help shed light on parallels in the human genome.
“The biggest question we asked ourselves was, how do we go beyond simple annotation? How do we compare all these datasets together to reveal new insights?” says Kellis. To do so, Kellis and his group at MIT attempted to reconstruct the full regulatory network of the fly from the pooled datasets.
To assess the completeness of their reconstructed model, Kellis’s Data Analysis Center attempted to predict gene expression levels based solely on the expression levels of regulators. Looking across numerous
developmental stages and cell lines, Kellis’s group was able to successfully predict over 60 percent of gene expression patterns in about a quarter of the cell lines studied.
These are only very preliminary models, Kellis says, and predicting the expression patterns of an entire genome remains an enormously complex problem. For modENCODE’s first round of predictive modeling, for example, the group was only able to incorporate a certain subset of pretranscriptional functional elements whose targets are already well-established. As more and more of the targets of the newly mapped regions are characterized, Kellis and others in the computational field will be able to cast a wider net to tease out the underlying logic of genomics. “We can only assume that the rules are there and keep looking,” says Kellis. “But the reproducibility of biology tells us that these rules must exist.”
The future of ENCODE
The original draft of the human ENCODE stated that the project would proceed in three stages: a pilot phase, a technology development phase, and a production phase. Now that modENCODE is complete and the methodologies are finally tested and refined, all that remains for ENCODE is the massive production phase. “There’s been a lot of thinking about how to go about systematically understanding the human genome, and it’s out of those conversations that modENCODE emerged,” says Kellis. The task is no less gargantuan, but with the technology and framework finally in place, a completed human ENCODE may only be a few years away.
With the modENCODE papers now published, more than 80 percent of the fruit fly genome is annotated and fully available to the public—up from about 25 percent before the project began. Yet though the consortium has assembled an impressively huge dataset, we are still unable to trace exactly how a single cell with a single copy of DNA becomes a
complex living and breathing organism. The Drosophila and C. elegans genomes have been “mapped,” but it’s really only the faint outlines of function that have emerged—we do not yet know the intricate mechanisms by which each of the elements work, let alone their very specific targets. “The modENCODE project was really just interested in providing a starting map—the equivalent of the first explorers coming to the New World,” says Karpen. “We need large-scale projects like this to provide the kind of foundational knowledge that allows the more intricate mechanisms to be worked out from there.” A complete understanding of life’s genetic computations may be far off, but we now have the first maps to guide us. The dark genome is getting lighter and lighter.