Showing posts with label chimeric sequences. Show all posts
Showing posts with label chimeric sequences. Show all posts

Monday, December 07, 2009

Chimera! (Part 1)

NOTE: I've been sitting on this for quite awhile, and while I wanted to add to it, i figure I may as well post this now, and then followup at a later date. I think it can hold up on its own for the purposes of discussing the problem. So I'll call this post "Part 1" for now.

To the left is the mythical creature known as the chimera. Though, to be honest, do you know how hard it is to find an actual drawing of what the mythical chimera was described as? It was a fire-breathing creature with the body of a lion, with a tail ending with a snake's head. On top of that a goat head sprouts up from the middle of the back. Weird creature, eh? And you wouldn't think that's too terribly hard to draw, but that's not typically how artists do it. I dunno, artistic license and all that jazz, eh? But anyways, that's a bit of an aside, as I don't really intend to hearken back to my geeky AD&D-playing teenage years to talk about mythical creatures. Rather, I want to talk about the phenomenon that gene jockeys who do microbial community analysis face when doing 16S rDNA gene sequence analysis. Yes, I'm talking about that chimera.

It's a problem, and depending on who you read, it's a very BIG problem at that. I'm going to cite a couple of papers to illustrate. The first is by Kevin Ashelford et al. who titled a 2005 AEM article: At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies. Quoting from their abstract (this article should be in the public domain by now, so I'll link to it down below):
A new method for detecting chimeras and other anomalies within 16S rRNA sequence records is presented. Using this method, we screened 1,399 sequences from 19 phyla, as defined by the Ribosomal Database Project, release 9, update 22, and found 5.0% to harbor substantial errors. Of these, 64.3% were obvious chimeras, 14.3% were unidentified sequencing errors, and 21.4% were highly degenerate.
Translation: Our sequence databases (which start with GenBank) are a mess.

Then there is this study by Hugenholtz and Huber from 2003 in IJSEM entitled: Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. Quoting their abstract (I believe this is also a freely accessible article) they state:
A significant number of chimeric 16S rDNA sequences of diverse origin were identified in the public databases by partial treeing analysis. This suggests that chimeric sequences, representing phylogenetically novel non-existent organisms, are routinely being overlooked in molecular phylogenetic surveys despite a general awareness of PCR-generated artefacts amongst researchers.
Now, this article was written a little over 6 years ago, but it continues to be a problem. These sequences are still in the databases and will no doubt remain there in perpetuity. So yah, we're stuck with the mistakes and total messes that people have submitted in the past. What needs to stop happening is adding to the problem in the future. That doesn't seem to be happening though. In 2006 Ashelford published again in AEM that recent large library submissions contained high percentages of chimeric sequences.
Defining a large library as one containing 100 or more sequences of 1,200 bases or greater, we screened 25 of the 28 libraries and found that all but three contained substantial anomalies. Overall, 543 anomalous sequences were found. The average anomaly content per clone library was 9.0%, 4% higher than that previously estimated for the public repository overall. In addition, 90.8% of anomalies had characteristic chimeric patterns, a rise of 25.4% over that found previously. One library alone was found to contain 54 chimeras, representing 45.8% of its content. These figures far exceed previous estimates of artifacts within public repositories and further highlight the urgent need for all researchers to adequately screen their libraries prior to submission.
In this article they talk about a program called Mallard which they use for chimera detection. As a matter of fact, there are a number of programs that can be used to identify 16S rDNA gene chimeras. Two of the most well-known programs are Chimera_Check and Bellerophon. Bellerophon is, of course, the mythical Greek who slew the chimera. All these programs work well, and they do what they are intended to do, under the proper conditions. I add that caveat because as most people who use these know, the conditions are important.

For instance, Chimera Check and Bellerophon are champs at chimera detection when looking at full length 16S rDNA gene sequence. That means that your sequences should be roughly in the 1,500 basepair ballpark. So when you want to analyze sequences in the 300 to 600 base pair size range, those sequences will likely get thrown out by those programs. Chimera Check goes so far as to flat out state that sequences less than 400 base pairs in size may not be reliably analyzed, so it's definitely a case of "buyer beware". So, if you are someone who likes to look at a couple of variable regions (and there is good reason to narrow down your focus, which I hopefully will get to in another blog entry) you're going to have to find another way to check for chimeric sequences.

References
Ashelford, K., Chuzhanova, N., Fry, J., Jones, A., & Weightman, A. (2005). At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies Applied and Environmental Microbiology, 71 (12), 7724-7736 DOI: 10.1128/AEM.71.12.7724-7736.2005 (PDF, 13 pages).

Hugenholtz and Huber. Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. 2003. IJSEM. 53: 289-93. (PDF, 5 pages).

Ashelford et al. New Screening Software Shows that Most Recent Large 16S rRNA Gene Clone Libraries Contain Chimeras. 2006. AEM. 72(9): 5734-41. (PDF, 8 pages)

Monday, November 30, 2009

What I learned when I was in Pittsburgh ...

Note: I'm a bit late with this one. Had it 80% of the way written and then forgot about it.

Other than learning that GPS systems are stupid and suck and should never be used, I did learn a few interesting things. I'll relate them here in this post. Some will be rants, some will hopefully be useful bits of information for other people as well (or at the very least drum up a bit of discussion).

1. The American Society of Agronomy is losing members. Hemorrhaging might be the more appropriate word. The society has lost approximately a third of the membership over the last decade, and the trend still points downwards. This is part of the reason that ASA is planning a restructuring which will be coming to a vote probably some time in November.

2. The Cherry Quadzilla at Church Brew Works was awesome. It also only comes in 750ml bottles, so you'll want to take a cab (which we did). Also, bring a camera ... it's a very interesting pub. The chicken pot pie was also really good. As were the perogies.

3. When you are arranging the poster sessions, and you map out where the posters are going to be, discuss the lighting situation with the people at the Convention Center. There was at least one section that after the sun went down and cut out all of the natural light, was in the dark. It was dark enough to hinder reading posters from afar. The rest of the hall was just fine, but at least 40 poster boards were placed underneath an overhang that had absolutely no lighting. That means that approximately 120 people (40 boards over 3 days) had to put up with those cruddy conditions. Unacceptable. I find it hard to believe that after realizing people were sitting in the dark on Monday evening, that something could not have been done to redirect the posters to a better lit area ... which would have required moving some poster boards maybe 40 to 50 feet away. A sign pointing in the direction would have sufficed. The meeting was filled with Ph.D.'s ... if they couldn't handle a redirection of poster boards, they had no job being there anyways.

4. If you plan a business meeting for the 30 minutes before a poster session you are hosting, make sure the business meeting doesn't start with a 45 minute presentation on restructuring. Some of us really had to get to the poster session because we were, you know ... presenting.

5. Metagenomics fraught with pitfalls. I know metagenomics, especially deep 16S sequencing, has taken the microbial ecology world by storm. A whole lot of people have been swept up in the frenzy, including yours truly. Now, I can be a worry wort, and after attending a session where people talked about various metagenomic projects, I just had to go up and ask a question. What about the chimeric sequences? I have a long-ish blog entry on chimeric sequences (which I really need to get out onto the site, it's still a draft), so I won't go into too great a depth on them now, but they're PCR artifacts which can increase your microbial diversity artificially. As pyrosequencing reads get longer, the chances of getting artifacts are probably increasing as well. Any amplification based system is going to have this problem.

So how do you avoid it? Well, I had hoped the people at this session, who have done metagenomic sequencing now for a few years, would have the answers.

None of them did. They figured (rightly, I might add) that the rates would be about the same for these approaches as they would with regular sequencing schemes, but it doesn't help with how to deal with them. Let's do the math.

If you have a 6% chimeric sequence rate (a reasonable value I believe), then you'll have to throw out 6 sequences for every 100 you do. If you have a 200,000 read metagenomic project that's 12,000 sequences you have to throw out. Problem is, how do you find them? Chimera_Check and Bellerophon, the two major programs on the net that do chimera checks, are really hands on programs. You have to really check the results. That's impossible (I won't say near impossible because who is going to read a quarter of a million chimera check reports, other than no one). So what we're seeing is ... no viable (and reliable) way (that I can see) to check metagenomic projects for chimeras. That sucks.