Monday, December 07, 2009

Chimera! (Part 1)

NOTE: I've been sitting on this for quite awhile, and while I wanted to add to it, i figure I may as well post this now, and then followup at a later date. I think it can hold up on its own for the purposes of discussing the problem. So I'll call this post "Part 1" for now.

To the left is the mythical creature known as the chimera. Though, to be honest, do you know how hard it is to find an actual drawing of what the mythical chimera was described as? It was a fire-breathing creature with the body of a lion, with a tail ending with a snake's head. On top of that a goat head sprouts up from the middle of the back. Weird creature, eh? And you wouldn't think that's too terribly hard to draw, but that's not typically how artists do it. I dunno, artistic license and all that jazz, eh? But anyways, that's a bit of an aside, as I don't really intend to hearken back to my geeky AD&D-playing teenage years to talk about mythical creatures. Rather, I want to talk about the phenomenon that gene jockeys who do microbial community analysis face when doing 16S rDNA gene sequence analysis. Yes, I'm talking about that chimera.

It's a problem, and depending on who you read, it's a very BIG problem at that. I'm going to cite a couple of papers to illustrate. The first is by Kevin Ashelford et al. who titled a 2005 AEM article: At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies. Quoting from their abstract (this article should be in the public domain by now, so I'll link to it down below):
A new method for detecting chimeras and other anomalies within 16S rRNA sequence records is presented. Using this method, we screened 1,399 sequences from 19 phyla, as defined by the Ribosomal Database Project, release 9, update 22, and found 5.0% to harbor substantial errors. Of these, 64.3% were obvious chimeras, 14.3% were unidentified sequencing errors, and 21.4% were highly degenerate.
Translation: Our sequence databases (which start with GenBank) are a mess.

Then there is this study by Hugenholtz and Huber from 2003 in IJSEM entitled: Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. Quoting their abstract (I believe this is also a freely accessible article) they state:
A significant number of chimeric 16S rDNA sequences of diverse origin were identified in the public databases by partial treeing analysis. This suggests that chimeric sequences, representing phylogenetically novel non-existent organisms, are routinely being overlooked in molecular phylogenetic surveys despite a general awareness of PCR-generated artefacts amongst researchers.
Now, this article was written a little over 6 years ago, but it continues to be a problem. These sequences are still in the databases and will no doubt remain there in perpetuity. So yah, we're stuck with the mistakes and total messes that people have submitted in the past. What needs to stop happening is adding to the problem in the future. That doesn't seem to be happening though. In 2006 Ashelford published again in AEM that recent large library submissions contained high percentages of chimeric sequences.
Defining a large library as one containing 100 or more sequences of 1,200 bases or greater, we screened 25 of the 28 libraries and found that all but three contained substantial anomalies. Overall, 543 anomalous sequences were found. The average anomaly content per clone library was 9.0%, 4% higher than that previously estimated for the public repository overall. In addition, 90.8% of anomalies had characteristic chimeric patterns, a rise of 25.4% over that found previously. One library alone was found to contain 54 chimeras, representing 45.8% of its content. These figures far exceed previous estimates of artifacts within public repositories and further highlight the urgent need for all researchers to adequately screen their libraries prior to submission.
In this article they talk about a program called Mallard which they use for chimera detection. As a matter of fact, there are a number of programs that can be used to identify 16S rDNA gene chimeras. Two of the most well-known programs are Chimera_Check and Bellerophon. Bellerophon is, of course, the mythical Greek who slew the chimera. All these programs work well, and they do what they are intended to do, under the proper conditions. I add that caveat because as most people who use these know, the conditions are important.

For instance, Chimera Check and Bellerophon are champs at chimera detection when looking at full length 16S rDNA gene sequence. That means that your sequences should be roughly in the 1,500 basepair ballpark. So when you want to analyze sequences in the 300 to 600 base pair size range, those sequences will likely get thrown out by those programs. Chimera Check goes so far as to flat out state that sequences less than 400 base pairs in size may not be reliably analyzed, so it's definitely a case of "buyer beware". So, if you are someone who likes to look at a couple of variable regions (and there is good reason to narrow down your focus, which I hopefully will get to in another blog entry) you're going to have to find another way to check for chimeric sequences.

Ashelford, K., Chuzhanova, N., Fry, J., Jones, A., & Weightman, A. (2005). At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies Applied and Environmental Microbiology, 71 (12), 7724-7736 DOI: 10.1128/AEM.71.12.7724-7736.2005 (PDF, 13 pages).

Hugenholtz and Huber. Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. 2003. IJSEM. 53: 289-93. (PDF, 5 pages).

Ashelford et al. New Screening Software Shows that Most Recent Large 16S rRNA Gene Clone Libraries Contain Chimeras. 2006. AEM. 72(9): 5734-41. (PDF, 8 pages)


microbiologist xx said...

So how are these sequences listed in the database?

Thomas Joseph said...

Like every other sequence. That's another part of the problem.

microbiologist xx said...

Sorry, i should have been more specific. What I meant to ask was what are they listing regarding the species since a chimeric 16S would indicate a novel template was used, although falsely.
Also, once something is identified as chimeric, does it remain in the database?

Thomas Joseph said...

Well, depends on the chimeric sequence and the people doing the classification. Sometimes it's listed only under the Genus (e.g., Bacillus sp.) or just listed by Phylum (e.g., Alpha Proteobacteria). The latter is much more prevalent. Sometimes you'll even see something like "Unclassified bacterium", like that helps. I think there are a few in there which are tagged as a novel species, but 16S sequence alone isn't usually enough to get something classified as a new species.

And yes, even though something is tagged as chimeric, it remains in the database. Worse yet, it's not even tagged as chimeric. I know of no instances where sequences tagged as chimeric, were reclassified/renamed to indicate their confirmed chimeric status. That's another problem with GenBank.

soil mama said...

I find that "uncultured soil clone" or "uncultured soil bacteria/fungi) are common. They could be novel, or chimeric, but you just don't know.

Maybe Tom will talk about them in the next part of chimera, but Fungi are often identified by their ITS region which is too divergent to tree across genera. Chimera detection methods that rely on alignments (Mallard and Bellerophon) don't work on these sequences.

Thomas Joseph said...

Yes, it should be noted that essentially any gene, not just 16S, 23S, or the ITS, can suffer chimeras. If there is enough homology between the products of the species found in the sample, you can get re-annealing between different products. The result is a chimera.

Heck, there are reports of chimeric sequences being generated between paralogs from the same organism!