Monday, April 27, 2009

GenBank Lesson for the Day - Ok, it's a rant

GenBank can sometimes be very usefulless. Ok, maybe not useless*, but a general pain in the toosh. Case in point ...

Our lab is doing some real time PCR to examine the gene densities of denitrification genes in a number of environments. In order to do so you need to make plasmid standards which will allow you to generate a concentration curve based on dilutions of that control plasmid. Then you can take your environmental sample, run the real time PCR on it with that particular primer set and match up the Ct value with your standard. From that, you can figure out how many copies you have in your sample (per mL, per g, per ng DNA ... whatever you choose).

What is important is that your plasmid be what you think it is and not just some amplified product of the correct size which works with the primers but was actually an unrelated gene. How can this happen? Well, the primer set for nirK ... the original manuscript published on that primer set uses amplification conditions starting at 43C and stepping down to 40C. That's not exactly a stringent condition, and so you get a whole mess of crap amplifying. The first time we tried it, we had a product of roughly the correct size but upon sequencing hit some gene encoding an enzyme in Nitrosomonas eutropha, and it wasn't nirK. As a matter of fact, the reverse primer was a 100% match to this particular non-nirK gene. Yah, that sucked.

At any rate, we started over again and got a better band. [If you read the original manuscript they don't even gel visualize their PCR products ... they did southern blots ... and claim that if you concentrate your sample down far enough you'll see a really faint band along with a huge smear. Ok, great ... thanks for that]. So, we did that, got a product, cloned it and shipped it off for sequencing.

The sequences came back today and I began analysis. So the first thing I did was blastn (BLAST against the nucleotide database) our data against GenBank. They came up, fortunately, to "nirK sequences" ... but they were all from uncultured isolates. Knowing the mess we had previously gone through I wasn't satisfied with that. There are a lot of things out there in the GenBank database which are not what they say they are (it's a non-curated database after all) so I wasn't taking that for an answer. So, I did a blastx (translate our DNA sequence to protein sequence and BLAST against the protein database) and came up with NirK, once again all from uncultured organisms. If it's uncultured, how did they get protein from it? Well, 99.9999% of the time they didn't ... it's just a translation of the nucleotide data and it hasn't been verified. Therefore there is no guarantee that the protein is what it is since the database is not curated. For example, if you submit a sequence that you claim is nirK, then the protein translation is NirK, even if the gene you accidentally cloned is fucK (for the sake of argument**, and an eventual pun). And from that point on, everyone who accidentally clones fucK thinks they have nirK, but they don't. Leaving them fucK'd. Add to this the fact that they'll submit their fucK'd sequences as nirK and you see how this can eventually snowball out of control.

And don't think that can't happen because it has happened before ... the first complete microbial genome ever sequenced, H. influenzae claims to have two lactoferrin binding protein operons. Problem is, it actually doesn't ... one of those operons encodes for a pair of transferrin binding proteins, it was just mis-annotated ... and will be forever. How many other sequences were subsequently mislabeled based on that GenBank comparison?

So just because GenBank told me at this point that I had nirK, I was not satisfied. What you need to do at this point is do a blastx but under the "Choose Search Set", select "SWISSPROT protein sequences (swissprot)" as opposed to the "Non-redundant protein sequences (nr)" most of which we now know are not really protein sequences but are rather unverified translations from nucleotide sequences (and there are distinct differences between the two). Why SWISSPROT? According to their website:
UniProtKB/Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)
It's a manually curated database of actual protein sequences. So while you won't get a large number of results, the results you get will have been verified many times over.

Fortunately for us, SWISSPROT informed us that our sequence did indeed match a curated NirK protein (from A. faecalis). Yay us!

So, a final word of caution. GenBank can prove to be very useful, but only if you take your results with a grain of salt ... and you're conscious of what the data is telling you. Knowing how to best frame your question so you can get solid results is also a very important skill when utilizing GenBank.

*It's a highly useful repository, but it's got some major flaws which can cause some big headaches if people aren't aware of them.

**As far as I know, fucK which encodes fuculose kinase and nirK which encodes a nitrite reductase do not share any sequence similarity so having this occur is infinitesimally slight.

No comments: