Join 3,516 readers in helping fund MetaFilter (Hide)


ENCODE: the Encyclopedia of DNA Elements
September 6, 2012 7:52 AM   Subscribe

In 2001, we learned the sequence of our genome; now, we have amassed a vast amount of knowledge about what those sequences actually do. Yesterday, the data from the ENCODE project went live.

ENCODE, the Encyclopedia of DNA Elements, is a project to identify and annotate all functional elements of the human genome, including transcription, transcription factor association, chromatin structure, and histone modification. It is a major achievement that was announced in a fanfare of publications. An elegant and accessible description of the project was posted by science writer Ed Yong on his Discover blog Not Exactly Rocket Science.

The data is publicly explorable through a very nice interface, including an iPad app and a virtual machine.

ENCODE constitutes a vast amount of data that will have a significant impact on research in genetics, bioinformatics, and medicine. Ewan Birney, the lead data analysis coordinator, discusses in Nature how the vast amount of data was wrangled, and has posted additional thoughts on his blog.
posted by Westringia F. (32 comments total) 41 users marked this as a favorite

 
Thanks for writing such a well linked post! I plan on spending a lot of today exploring this data. It's an enormous achievement in biology, a significant breakthrough in how the genome actually works. If I may add one more link, the NYTimes article is a good layman's explanation of what the project is and why it's so important.
posted by Nelson at 8:18 AM on September 6, 2012


Wow, thanks for this. I've been waiting for this project to publish, it's exciting to see the start that they've made and what conclusions they've begun to draw.

That post on Not Exactly Rocket Science* is superb; I'm bookmarking it to use as my go-to explanation of sequencing and genome annotation. I'd strongly suggest it as the best place for non-biologists to start on this post. This announcement is very much a small first installment in a long, slow series of experiments and analysis but, even given that there's much more to do, the release of these data should facilitate a huge jump in our understanding of the genome and what it does.

It that interface on Nature's site something they do often for big releases of papers on related topics? At first glance, it's a really good way of presenting the information, and it'd be interesting to see the same principle applied to other sets. Regardless, I can see myself spending a few hours over the next few days digging through this lot.

*You have a typo in the post: the blogger's name is Ed Yong.
posted by metaBugs at 8:23 AM on September 6, 2012


Squee! I haven't gotten to read much except a few blog posts. Michael Eisen's post points out that the media (and PR) emphasis on "junk DNA isn't junk!" is pretty misguided. It's pretty awe-inspiring and it makes me happy that humans can do huge projects like this.
posted by R343L at 8:38 AM on September 6, 2012


All the associated publications are Open-Access, btw.

Aw, crikey -- thanks for the correction, metaBugs! I'll ping the mods.
posted by Westringia F. at 8:38 AM on September 6, 2012 [1 favorite]


[typo fixed, carry on!]
posted by jessamyn at 8:42 AM on September 6, 2012


The best parts?
1. We eliminate a misnomer. No more "junk DNA", it all has a function.
2. All the journals are open access, according to the last link.
posted by francesca too at 8:53 AM on September 6, 2012 [1 favorite]


The virtual machine strikes me as better than the "junk DNA" misnomer going away, but I won't quibble. To a software person, the idea of getting data "out there" in an easily re-usable way is awesome to me.
posted by R343L at 9:15 AM on September 6, 2012


R343L: "The virtual machine strikes me as better than the "junk DNA" misnomer going away, but I won't quibble. To a software person, the idea of getting data "out there" in an easily re-usable way is awesome to me."

Maybe I misunderstand the purpose of the VM, but I figure a vagrant/veewee deploy would be saner.
posted by pwnguin at 9:39 AM on September 6, 2012


Unfortunately I found this only now, but here is ENCODE on MeFi Projects! There are TONS of great links there -- very few of which are duplicated here -- as well as discussion by MeFi's own grouse and Blazecock Pileon, both of whom are contributors to the ENCODE project.
posted by Westringia F. at 9:41 AM on September 6, 2012 [3 favorites]


Hopefully he'll be along here at some point, but grouse also answered questions in an interesting Reddit thread.
posted by lalex at 9:48 AM on September 6, 2012


This is a fantastic achievement.

| 1. We eliminate a misnomer. No more "junk DNA", it all has a function.
I agree. I think that most people working in the field never really excepted this kind of tabloid label.

This quote from from a commentor on The Guardian's article relating to the ENCODE publications pretty much sums up the damage it has done to the public understanding of the importance of the genome:

"How can scientists look for "errors" in DNA at the same time as discarding the parts that they don't understand as JUNK? More like JUNK science!"

To me, it was always a way of attributing a lack of understanding to a biological phenomena. If you work with genomes, you are probably no longer surprised by the ever increasing complexity of the regulatory and gene expression information that are revealed by the regular torrent of publications on the subject. Genomes are truly amazing things!
posted by SueDenim at 9:51 AM on September 6, 2012


pwnguin: People doing additional work with the data are probably going to use tools and data from here directly I imagine (and obviously gather more data). But the VM provides a very simple way for someone to get at (theoretically) everything needed to understand how they got to various results in all those papers.

Also, I posted a MeFi meetup to buy grouse a beer and I'm happy to learn about another project contributor as well. Congratulations to grouse and Blazecock Pileon!
posted by R343L at 9:52 AM on September 6, 2012


Cool, so can we fix my asthma, migraines, and allergies yet?
posted by Aizkolari at 10:19 AM on September 6, 2012


From one of the main links:

The main paper has nearly 450 authors, working from more than 30 institutions.

Huge &
Kudos!
posted by bukvich at 10:25 AM on September 6, 2012


Ewan Birney is my hero.

Large projects like this are so much better than postdoc-created academic abandonware that dies with the publication.
posted by benzenedream at 11:00 AM on September 6, 2012 [2 favorites]


This is a nice post! As mentioned, I have been heavily involved in this project for four years.

All the journals are open access, according to the last link.

Transparency and open access is an important principle for us. Raw and processed data were uploaded to our Data Coordination Center immediately after they were shown to be reproducible, and were there made available for anyone to study. People had permission to use any of this data in their own publications nine months after it was made available, without having to wait for the ENCODE researchers to publish their own analysis first. Our "User's Guide to the Encylopedia of DNA Elements" from April 2011 already has more than 100 citations in the literature. This is important—in other projects people have started using publicly-available but embargoed data and have had to wait a long time for the primary researchers to publish their work first, which always takes longer than one thinks. With our data release policy, you always know exactly when you will be able to publish your own work.
posted by grouse at 11:27 AM on September 6, 2012 [6 favorites]


Don't miss Blazecock Pileon's description of a part of what his group contributed to the effort.

Do I need to say "MeFi's own Blazecock Pileon" or is that redundant?
posted by grouse at 11:41 AM on September 6, 2012 [1 favorite]


Hah:
Given these blurred boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honour falls to the transcript, made of RNA rather than DNA. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.” A “gene” then becomes a collection of transcripts, united by some common factor.
Suck it, Dawkins
posted by crayz at 11:54 AM on September 6, 2012


Any biologists like to explain exactly what the difference between "biochemically active" and "functional" is?

Where does the teleology of the latter come from? "Function" in what sense?
posted by stonepharisee at 1:07 PM on September 6, 2012


Any biologists like to explain exactly what the difference between "biochemically active" and "functional" is?

Obviously, people are interested in the "functional" parts of the genome rather than the parts that are not "functional." Unfortunately, there is no rigorous definition of "functional" that people will universally agree on. It's not teleological—not a question of whether someone designed it to act a certain way, but whether it actually does something measurable. The threshold of the measurable phenomenon is a matter of debate.

The leader of the analysis effort (and my former PhD supervisor), Ewan Birney has an extensive discussion of this on his blog. A small excerpt:
Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases.
It's worth pointing out that some notable genome biologists strenuously disagree with this definition. There's a little on that in the update in Ed Yong's blog post.
posted by grouse at 1:36 PM on September 6, 2012 [2 favorites]


Thanks grouse. It does indeed seem to me that that could be controversial, and that quite a lot hangs on just what is meant there.
posted by stonepharisee at 2:39 PM on September 6, 2012


Nature news features editor Brendan Maher follows up on some of the immediate reaction to ENCODE. While there is a lot of positive reaction to the big release, a couple of scientists complained about the delay in public access to the papers released today in order to bring about the coordination of their simultaneous release. Also, as probably many expected, several scientists said the definition of "functional" went too far. On the other hand, some said it didn't go far enough:
John Mattick, director of the Garvan Institute of Medical Research in Sidney Australia, who I spoke to in the run up to the publication of these papers, argued that the ENCODE authors were being far too conservative in their claims about significance of all that transcription. “We have misunderstood the nature of genetic programming for the past 50 years,” he told me. Having long argued that non-coding RNA plays a critical role in cell regulatory functions, his gentle criticism is that “They’ve reported the elephant in the room then chosen to otherwise ignore it.”
posted by grouse at 3:26 PM on September 6, 2012 [2 favorites]


No more "junk DNA", it all has a function.

I'm pretty sure some of it is junk. Somewhere between 5-8% of the human genome is made up of fragments of ancient retroviruses. Granted, some of these fragments have probably been co-opted over time into functional genes, but I wouldn't be surprised if most of it is just genomic pollution.

Personally, I'm skeptical of the importance of this project. There's a whole dimension of complexity between the genome and the proteome, and a second, even larger dimension of complexity between the proteome and observable phenotypes. Imagine trying to predict the weather based on your knowledge of atomic theory. In principle, it should be possible. In practice, the number of variables involved and our limited understanding of the physical processes makes the project hopelessly infeasible.

That's not to say that projects aren't worth doing, just that one must be careful to temper their expectations. This isn't a paradigm shift, it's step 112 of a 10^29 step journey.
posted by dephlogisticated at 4:52 PM on September 6, 2012 [1 favorite]


I'm pretty sure some of it is junk.

IANAMB, but it seems like that very fact that it's still there implies much of it has utility. If it were truly junk, then you'd expect it to mutate away into nothingness. Whole chunks could be lost, the way species that live in total darkness lose sight, because the loss of it doesn't matter to the organism.
posted by CheeseDigestsAll at 6:44 PM on September 6, 2012


“They’ve reported the elephant in the room then chosen to otherwise ignore it.”

Ah, I just love to watch scientists bickering, that is just the way it should work!
posted by sammyo at 7:03 PM on September 6, 2012


IANAMB, but it seems like that very fact that it's still there implies much of it has utility. If it were truly junk, then you'd expect it to mutate away into nothingness.

As you say, genes which no longer provide an evolutionary advantage tend to degenerate over time. That's exactly what's happened with the endogenous retroviruses. To my knowledge, no ERVs identified thus far have been found functionally intact (that is, capable of replication); they are fragments and pieces of retroviruses scattered throughout the genome showing various degrees of degeneration. In fact, biologists use the degree of degeneration (or more accurately, homology with known viruses) as a way of estimating the age of germline insertion. Some of the more recent ones have been estimated to be only 150,000 years old, which is quite recent in evolutionary terms.
posted by dephlogisticated at 7:59 PM on September 6, 2012


Imagine trying to predict the weather based on your knowledge of atomic theory. In principle, it should be possible. In practice, the number of variables involved and our limited understanding of the physical processes makes the project hopelessly infeasible.

I have to disagree with the aptness of this analogy. In practice, we already have myriad examples where we can trace organism-level phenotype back to very simple changes in DNA or gene regulation. Butterflies and hurricanes it is not.
posted by grouse at 9:47 AM on September 7, 2012


Well, some of it -- and I admit I'm protecting my research turf a bit here ;) -- is butterflies and hurricanes. But that's exactly why the ENCODE project is so valuable: the better we understand the components and connections of a system, the better we can model its behavior.

I'd like to say something about this, because it makes a very interesting point:

Imagine trying to predict the weather based on your knowledge of atomic theory. In principle, it should be possible. In practice, the number of variables involved and our limited understanding of the physical processes makes the project hopelessly infeasible.

I'm a physicist whose interest in complex systems lead me into computational biology, and I often feel (in some joint-smoking, far-out-man way) that the state of knowledge in biology today resembles that of physics 150 or so years ago. We had, at the time, a clear grasp of Newtonian mechanics and could describe collisions between particles; we also had a description of the thermodynamics of gasses, and could describe how temperature related to expansion, &c. What we lacked was a way to connect the microscopic to the macroscopic: a way to connect the atomic-level detail to the weather [in equilibrium!!!]. That's where statistical mechanics came in -- instead of accounting for every molecule in the gas, we're able to step back, average over them, and be able to say something about the bulk behavior of the system; and, furthermore, we could deduce how those bulk properties would depend on the characteristics of its microscopic constituents. We don't predict the weather by accounting for every atom, but we can say how the atomic composition of the air will affect it.

I think there's an analogy of sorts to be made there. Due to huge advances in high-throughput assays, we have increasingly detailed genomic, transcriptomic, and proteomic descriptions of our samples; we also have very good descriptions of their bulk characteristics, ie, their phenotypes. What we need now is, in some sense, a stat mech for biology: a way to connect the microscopic to the macroscopic in a way that captures and summarizes the essential molecular properties without getting bogged down in detail. And that's where something like the ENCODE project is very valuable, because it provides clues to model-builders like myself about what can be ignored, what can be averaged-over, and what must be tracked in detail.
posted by Westringia F. at 10:27 AM on September 7, 2012


In practice, we already have myriad examples where we can trace organism-level phenotype back to very simple changes in DNA or gene regulation.

There are a number of diseases which are indeed tied to specific alleles or point mutations. Similarly, there are some phenotypic traits (e.g., eye color) which depend on only one or a few very well-characterized genes. But the vast majority of diseases do not have a straightforward genetic cause.

To give an example, there are a number of genes that are associated with Alzheimer's disease—that either cause or predispose someone to develop it. But the majority of people who develop AD do not actually have these genes, and some of the people who do have the genes don't develop AD. So either the cause is not fully genetic (which is likely), or the disease involves complex interactions between a large number of genes whose individual influence is too low to statistically detect.

Pick any major health condition (heart disease, breast cancer, diabetes, etc) and you tend to find a similar situation; there are genetic influences, but they are neither necessary nor sufficient to explain the disease. The genome is simply a list of parts and some basic rules for when to make them. The real complexity lies in the interactions between these parts and their collective response to the environment. Given the sheer number of components and the elaborate ways they interact, the difficulty of accurately modeling it is staggering.

What we need now is, in some sense, a stat mech for biology

I agree that we’re going to need new methods and tools, and I expect bioinformatics to become an increasingly essential part of molecular bio research in the coming years. The traditional reductive approach just doesn’t hold up when you’re dealing with complex, nonlinear systems. What worries me about bioinformatics, though, is the Garbage In-Garbage Out principle; your conclusions can only be as good as your data. And if your data comes from a multitude of different labs, it can only be as good as your enforcement of standardization. I fear it's inevitable that open databases will simply become huge depositories of statistical noise.
posted by dephlogisticated at 1:28 PM on September 7, 2012


The real complexity lies in the interactions between these parts and their collective response to the environment.

This I know. I am mainly saying that the case in biology is far from the case of not being able to predict the weather (top-level consequences) from atomic theory (low-level data and modeling). We could already predict top-level consequences in the form of phenotype from low-level data and modeling in a number of notable cases. Our ability to do this in slightly more complex cases has expanded as our understanding of biology at all levels has increased. I am confident that this trend will continue, and having a more complete understanding of gene regulation is going to make that possible.

And if your data comes from a multitude of different labs, it can only be as good as your enforcement of standardization.

First, this is one of the points of coordinated resources like ENCODE. (This is something to keep in mind when we are sometimes attacked by people on the other side who say that these big projects are unnecessary because these data could now be produced by small labs funded by research project grants.) Previously existing data has a lot of variation in experimental conditions and processing, so I think this is a big step forward.

Second, as part of creating a well-coordinated data resource ourselves, in the process we had to develop the standards to do this internally. And we can share what we've learned with the rest of the world. One of the papers that came out this week is our standards for doing the ChIP-seq experiments central to ENCODE (I am a co-author on this paper). I hope that if people can stick to these sorts of standards, as the experimental and analytical techniques continue to improve, signal-to-noise ratios get higher and it will make more sense to do lab-to-lab comparisons.
posted by grouse at 1:51 PM on September 7, 2012


Quite a few years ago I identified some human enzymes sequences using cDNA libraries. I haven't Done Science in a long time and I was hoping this would let me look up where my little friends were in the genome. Maybe I'm not using the ENCODE explorer correctly, but it doesn't seem as easy a task as I'd hoped.
posted by exogenous at 7:04 AM on September 14, 2012


Exogenous: BLAT on the UCSC genome browser should locate arbitrary sequences in a genome (NCBI blast also has a genome option these days).
posted by benzenedream at 10:10 AM on September 16, 2012 [1 favorite]


« Older More Australians have died from asbestos poisonin...  |  On September 15, 2001, at the ... Newer »


This thread has been archived and is closed to new comments