Wednesday, August 11, 2010

Another issue with GenBank

So received an email today from GenBank that started with the following:
Complete feature annotation has not been included for some or all of the sequence(s) you have submitted. We prefer to accept sequences that have been annotated by the submitting authors.
Of course they prefer to accept sequences that have been annotated by the submitting authors. It's one less thing they need to do, and I completely get that ... no problem with that from me.

So what annotation was missing? Protein translations for the sequences from my phylogenetic study of environmental samples. It's not a very large sample set (more than fifty, under a hundred) but it was (I believe) enough to get the job done for what we were looking at. We looked at a real time PCR amplification product, which was slightly on the long side (a tad over 400 base pairs) as far as qRTPCR goes. It's a partial sequence of the gene we are looking at, so I have no idea how things look 5' or 3' of the gene in question. As a matter of fact, the gene has about a 1000 base pairs of sequence upstream of the site we are looking at, and extends about an additional 250 to 300 base pairs downstream of our site as well. It's a fairly big gene which encodes a pretty large protein product.

So GenBank wants to know the protein translation of these clones from this environmental study. Me? I think it's a waste of time. Since they're partial sequences, we have no way of knowing if this sequence is even going to be used to make a protein. Since there is over 1,000 bp of sequence upstream, with any number of potential stop codons or frameshift's before we even get to the sequence in question ... it seems like a fruitless exercise. Do we really need these environmental sequencing projects, similar to mine and many many others, cluttering up BLAST reports? GenBank calls them "conceptual translations" and they DO show up in blastp reports. Since there are so many of them, they oftentimes dominate those reports. Now, I know GenBank has a "exclude uncultured/environmental sample sequences" as an option (which can be found under "Exclude" oddly enough) but it seems silly IMNSHO to automatically incorporate the thousands of these sequences in these reports. I suppose there are pro's and con's to each approach, and I know some people prefer to use protein sequences; though I've always read we want to find possible silent substitutions so you can't go wrong in choosing/preferring to use DNA sequences.

I don't know. Maybe my inner-curmudgeon is getting the better of me today. It just seems that, especially when it pertains to environmental samples, there can be something as "too much information" especially when that information is based on conceptual models. Do I think that these sequences are wrong simply because they're rooted in bioinformatics? No, I do not. I just don't see the utility in providing them in each and every case. Especially since it'd be simple enough to work them up when you download the DNA sequences. Of course, one would argue that it's simple enough to just submit them.

Educate me folks.

3 comments:

Ivan Privaci said...

Perhaps the blanket demand for putative protein sequences is just to pump up their "numbers", so they can go back to congress saying "look how successful we are, Genbank's added umpty-brazillion sequences in the last year!"

I see something similar at pubmed - I watch a feed of new articles for a subset of topics that I'm interested in, and I've been noticing more things like "Time" magazine editorials and other fluffy opinion-pieces, "advice for patients" articles from what look like college health department newsletters, and alot of entries with no abstracts available. "Look at all the articles we're adding! We obviously need more funding!"

(I wouldn't actually object to seeing NCBI's projects get more funding - even substantially more funding - but clogging their databases with crap in order to do it is just plain wrong...)

Tom said...

First, I should state that I did capitulate, added the protein seq's and received my GenBank numbers. Yay me! *rolleyes*

By and large, I have stopped using PubMed. I do use it, in part, when I'm lit searching through EndNote, but by then I know the author and paper I'm looking for and I just want to bring it into the citation manager. For the most part, I use Scopus now for my lit searches. It gives many more criteria one can EASILY search through to find relevant manuscripts, and gives really good post-search sorting options as well. Unfortunately, for most people the cost for the service places it out of their reach.

But, like you said ... NCBI's tendency to go for bulk has really hampered it's effectiveness (IMNSHO).

soil mama said...

I think that Genbank really needs to figure out a better way to deal with environmental samples in general.
For bacterial 16S and fungal ITS environmental samples I go around and around with them about how much info I need to go along with each sequence. All of my sequences are different and it's a PITA to annotate every single one of them, especially when I am submitting 500+ sequences a pop. I think it should be good enough to just put which primers I used to sequence and the regions the primers span (this is what others have done) but for some reason they keep coming back and asking for more annotations. I just don't get why you would need fully annotated sequences for "uncultured soil clones." If someone was REALLY interested in looking at my samples, they would look at which primers I used and just BLAST them or compare then to their samples.