Complete feature annotation has not been included for some or all of the sequence(s) you have submitted. We prefer to accept sequences that have been annotated by the submitting authors.Of course they prefer to accept sequences that have been annotated by the submitting authors. It's one less thing they need to do, and I completely get that ... no problem with that from me.
So what annotation was missing? Protein translations for the sequences from my phylogenetic study of environmental samples. It's not a very large sample set (more than fifty, under a hundred) but it was (I believe) enough to get the job done for what we were looking at. We looked at a real time PCR amplification product, which was slightly on the long side (a tad over 400 base pairs) as far as qRTPCR goes. It's a partial sequence of the gene we are looking at, so I have no idea how things look 5' or 3' of the gene in question. As a matter of fact, the gene has about a 1000 base pairs of sequence upstream of the site we are looking at, and extends about an additional 250 to 300 base pairs downstream of our site as well. It's a fairly big gene which encodes a pretty large protein product.
So GenBank wants to know the protein translation of these clones from this environmental study. Me? I think it's a waste of time. Since they're partial sequences, we have no way of knowing if this sequence is even going to be used to make a protein. Since there is over 1,000 bp of sequence upstream, with any number of potential stop codons or frameshift's before we even get to the sequence in question ... it seems like a fruitless exercise. Do we really need these environmental sequencing projects, similar to mine and many many others, cluttering up BLAST reports? GenBank calls them "conceptual translations" and they DO show up in blastp reports. Since there are so many of them, they oftentimes dominate those reports. Now, I know GenBank has a "exclude uncultured/environmental sample sequences" as an option (which can be found under "Exclude" oddly enough) but it seems silly IMNSHO to automatically incorporate the thousands of these sequences in these reports. I suppose there are pro's and con's to each approach, and I know some people prefer to use protein sequences; though I've always read we want to find possible silent substitutions so you can't go wrong in choosing/preferring to use DNA sequences.
I don't know. Maybe my inner-curmudgeon is getting the better of me today. It just seems that, especially when it pertains to environmental samples, there can be something as "too much information" especially when that information is based on conceptual models. Do I think that these sequences are wrong simply because they're rooted in bioinformatics? No, I do not. I just don't see the utility in providing them in each and every case. Especially since it'd be simple enough to work them up when you download the DNA sequences. Of course, one would argue that it's simple enough to just submit them.
Educate me folks.