I have had a few issues with genome annotation in the past couple of weeks that highlight how dependent NGS methods are on these reference files. In a mouse RNA-seq project, we found that RNA reads from some olfactory receptor genes were getting low expression counts (using TopHat/Cufflinks) because RNA reads were not mapping within the known exons of the genes. When we looked at the actual alignment BAM files in IGV, it was clear that the exon boundaries in the reference (RefSeq) genes were wrong. Both the 3' and 5' ends of some genes were missing regions with lots of RNA alignments. It would be nice to make a new reference genome using our RNA-seq reads to create our own annotation, but it is actually quite difficult to get one RNA sample with all genes expressed at useful levels.
In a different RNA-seq project on pig (a model system for cardiology), we ran the standard TopHat/Cufflinks pipeline using the RefSeq gene annotation from the Illumina IGENOMES project (a convenient way to get FASTA, GTF, and Bowtie index files for many genomes:http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn). When we looked at the alignment QC using the CollectRnaSeqMetrics from the Picard toolkit, we found that only about 20% of the reads were mapped to exons. A closer look at the annotation file showed that the RefSeq genome for pig (Sus Scrofa) only has about 4,000 genes. That just does not work for RNA-seq. ENSEMBL has a much more complete gene annotation for pig, but the ENSEMBL GTF file is not read nicely by TopHat/Cufflinks. Thanks to my very helpful friend Igor Dolgalev, I learned that the UCSC Genome Browser has a reformatted ENSEMBL GTF annotation for pig (and most other genomes), but it is missing gene names. The GTF from UCSC Table Browser looks like this:
This GTF file has over 30,000 genes and 500,000 lines for individual exons.
UCSC also has a conversion table for susScr3.ensemblToGeneName with names for about 20,000 of the ENSEMBL genes. I have not figured out a simple way to write these names back into the GTF file without breaking it, so for now I am matching the names into the final gene expression results.
My only conclusion from these issues with reference genomes is that we depend on these things to be right, and when they are not, NGS data analysis can be badly messed up. There is no simple solution for providing a perfect reference for every genome for every purpose, so bioinformaticians need to stay on their toes and dig deeper when results from standard methods don't make sense.