Oct 26, 2013

A bit more on genome annotation

One bit of followup on the Pig genome annotation story. We usually visualize RNA-seq results in the IGV browser. It allows direct inspection of read alignments to your favorite genes and can also be helpful to spot sequence variations and splicing issues.  However, IGV has a set of pre-loaded default genomes that also seem to be derived from RefSeq. So once again, working with data from the pig,  There was no annotation for most of our genes of interest. This is fairly annoying since it means that the only way to look at the annotation of a gene is to first look up the gene in UCSC and then copy the exact chromosome coordinates to IGV, including intron-exon borders.

It is possible to fix this by downloading to the local computer the ENSEMBL gene annotations from UCSC Table Browser as a BED file (not too large), and then loading the BED file into IGV as another data track. This works nicely in terms of showing the genes and exons, but the gene labels still carry the ugly ENSEMBL names. Once again, the ensemblToGeneName track comes in handy, providing a table with the ENSEMBL name and the Official gene symbol for about 20,000 genes. We were able to add the gene symbol to the BED file, but this has to be done carefully (in Perl or Awk) since making file edits in Excel seems to break the BED file (at least for me).  Loading the edited BED file into IGV, I was then able to jump to genes by name and get screen shots of interesting regions that included a gene structure track with nice gene names.