To the left is the mythical creature known as the chimera. Though, to be honest, do you know how hard it is to find an actual drawing of what the mythical chimera was described as? It was a fire-breathing creature with the body of a lion, with a tail ending with a snake's head. On top of that a goat head sprouts up from the middle of the back. Weird creature, eh? And you wouldn't think that's too terribly hard to draw, but that's not typically how artists do it. I dunno, artistic license and all that jazz, eh? But anyways, that's a bit of an aside, as I don't really intend to hearken back to my geeky AD&D-playing teenage years to talk about mythical creatures. Rather, I want to talk about the phenomenon that gene jockeys who do microbial community analysis face when doing 16S rDNA gene sequence analysis. Yes, I'm talking about that chimera.
It's a problem, and depending on who you read, it's a very BIG problem at that. I'm going to cite a couple of papers to illustrate. The first is by Kevin Ashelford et al. who titled a 2005 AEM article: At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies. Quoting from their abstract (this article should be in the public domain by now, so I'll link to it down below):
A new method for detecting chimeras and other anomalies within 16S rRNA sequence records is presented. Using this method, we screened 1,399 sequences from 19 phyla, as defined by the Ribosomal Database Project, release 9, update 22, and found 5.0% to harbor substantial errors. Of these, 64.3% were obvious chimeras, 14.3% were unidentified sequencing errors, and 21.4% were highly degenerate.Translation: Our sequence databases (which start with GenBank) are a mess.
Then there is this study by Hugenholtz and Huber from 2003 in IJSEM entitled: Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. Quoting their abstract (I believe this is also a freely accessible article) they state:
A significant number of chimeric 16S rDNA sequences of diverse origin were identified in the public databases by partial treeing analysis. This suggests that chimeric sequences, representing phylogenetically novel non-existent organisms, are routinely being overlooked in molecular phylogenetic surveys despite a general awareness of PCR-generated artefacts amongst researchers.Now, this article was written a little over 6 years ago, but it continues to be a problem. These sequences are still in the databases and will no doubt remain there in perpetuity. So yah, we're stuck with the mistakes and total messes that people have submitted in the past. What needs to stop happening is adding to the problem in the future. That doesn't seem to be happening though. In 2006 Ashelford published again in AEM that recent large library submissions contained high percentages of chimeric sequences.
Defining a large library as one containing 100 or more sequences of 1,200 bases or greater, we screened 25 of the 28 libraries and found that all but three contained substantial anomalies. Overall, 543 anomalous sequences were found. The average anomaly content per clone library was 9.0%, 4% higher than that previously estimated for the public repository overall. In addition, 90.8% of anomalies had characteristic chimeric patterns, a rise of 25.4% over that found previously. One library alone was found to contain 54 chimeras, representing 45.8% of its content. These figures far exceed previous estimates of artifacts within public repositories and further highlight the urgent need for all researchers to adequately screen their libraries prior to submission.In this article they talk about a program called Mallard which they use for chimera detection. As a matter of fact, there are a number of programs that can be used to identify 16S rDNA gene chimeras. Two of the most well-known programs are Chimera_Check and Bellerophon. Bellerophon is, of course, the mythical Greek who slew the chimera. All these programs work well, and they do what they are intended to do, under the proper conditions. I add that caveat because as most people who use these know, the conditions are important.
For instance, Chimera Check and Bellerophon are champs at chimera detection when looking at full length 16S rDNA gene sequence. That means that your sequences should be roughly in the 1,500 basepair ballpark. So when you want to analyze sequences in the 300 to 600 base pair size range, those sequences will likely get thrown out by those programs. Chimera Check goes so far as to flat out state that sequences less than 400 base pairs in size may not be reliably analyzed, so it's definitely a case of "buyer beware". So, if you are someone who likes to look at a couple of variable regions (and there is good reason to narrow down your focus, which I hopefully will get to in another blog entry) you're going to have to find another way to check for chimeric sequences.
Ashelford, K., Chuzhanova, N., Fry, J., Jones, A., & Weightman, A. (2005). At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies Applied and Environmental Microbiology, 71 (12), 7724-7736 DOI: 10.1128/AEM.71.12.7724-7736.2005 (PDF, 13 pages).
Hugenholtz and Huber. Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. 2003. IJSEM. 53: 289-93. (PDF, 5 pages).
Ashelford et al. New Screening Software Shows that Most Recent Large 16S rRNA Gene Clone Libraries Contain Chimeras. 2006. AEM. 72(9): 5734-41. (PDF, 8 pages)