Jun 28, 2011

The False Discovery of Mutations by Sequencing

     I am amazed by the success reported in recent papers finding mutations by Next-Gen Sequencing in rare genetic diseases and cancer. In our lab, the sequence data for SNPs is messy and difficult to interpret. The basic problem is that NGS data, particularly Illumina data in our case, contains a moderate level of sequencing errors. We get somewhere between 0.5% and 1% errors in our sequence reads from the GAII and HiSeq machines. This is not bad for many practical purposes (ChIPseq and RNAseq experiments have no trouble with this data) and this error level "is within specified operating parameters" according to Illumina Tech support. The errors are not random, they occur much more frequently at the ends of long (100 bp) reads. Some types of errors are systematic in all Illumina sequencing (A>T miscalls are most common), and other types of errors are common to a particular sample, run, or lane of sequence data. Also, when you are screening billions of bases, looking for mutations, rare overlaps of errors will occur.
     So if sequence data contains errors, and the point of your experiment is to find mutations, then when you find a difference between your data and the reference genome (a variant), you had better make doubly sure that the difference is real. There is a lot of software designed to filter out real mutations (SNPs) from the random sequence errors. The basic idea is to first filter out bad, low quality bases using the built-in quality scores produced by the sequencer. Second, require that multiple reads show the same variant, and that the fraction of reads showing the variant makes sense in your experiment: 40-60% might be good for a heterozygous allele in a human germline sample, 10% or less might make sense if you are screening for a rare variant in a sample from a mixed population of cells.  Also, it is usually wise to filter out all common SNPs in the dbSNP database - we assume that these are not cancer causing, and they have a high likelihood of being present in healthy germline cells as well as tumor cells.
     We have used the SNP calling tools in the Illumina CASAVA software, the MAQ software package, similar tools in SAMtools, and recently the GATK toolkit. In all cases, it is possible to tweak parameters to get a stringent set of predicted mutations, filtering out low quality bases, low frequency mutations, and SNPs that are near other types of genomic problems such as insertion/deletion sites, repetitive sequence, etc. Using their own tools Illumina has published data showing a false positive detection rate of 2.89% (Illumina FP Rate).  Under many experimental designs, validating 97% of your predicted mutations would be excellent.
     Unfortunately, our medical scientists don't want predicted SNPs vs. an arbitrary reference genome. They want to find mutations in cancer cells vs. the normal cells (germline or wild type) of the same patient. This is where all the tools seem to fall apart. When we run the same SNP detection tools on two NGS samples, and then look for the mutations that are unique to the tumor vs the wild type (WT), we get a list of garbage, thousands of lines long. We get stupid positions with 21% variant allele detected in tumor and 19% variant in WT. Or we get positions where the 80% variant allele frequency is not called as a SNP in WT because 2 out of 80 reads have a one base deletion near that base. So the stringent settings on our SNP discovery software create FALSE NEGATIVES where we miss real SNPs in the WT genome, which then show up as tumor-specific mutations in our SNP discovery pipeline.
     Zuojian Tang is creating a post-SNP data filter that imposes a sanity check on the data based on allele frequencies. We are trying out various parameters, but something like a minimum of 40% variant in the tumor and less than 5% variant in the WT narrows the list of tumor-specific mutations down to a manageable number that could be validated by PCR or Sequenom.

Jun 6, 2011

Involve Bioinformatics in design of every experiment... please

Two interesting projects came through our informatics group last week, both in the 'data drop' mode were the investigator asks for help to analyze data as it comes off of the sequencers. I have noted many times before, that our informatics effort is much greater on the poorly designed and failed experiments.

Experiment #1 was a seemingly standard SNP detection using exome sequences with 100 bp paired-end  reads on Illumina HiSeq (Agilent Sure Select capture) - the entire thing done by an private sequencing contractor. The contractor also supplied SNP calls using Illumina CASAVA software. Our job was simply to find overlaps between the SNP calls for various samples and controls, and to annotate the SNPs with genomic information (coding or non-coding, conservative mutations, biological pathways, etc).  However, we have an obsession with QC data, which the vendor was very reluctant to supply. Turns out that these sequencing reads have a 1.5% error rate, while our internal sequencing lab generates 0.5% error. We also see 10K novel SNPs in each sample with only minimal overlap across samples (a red flag for me). More QC data is extracted from the vendor, and now we see a steep increase in error at the ends of reads. So we wish to trim all reads down by 10-25% and recall SNPs - extract more files from vendor 3x (Illumina requires a LOT of runtime and intermediate files in order to run CASAVA for SNP calling).

Meanwhile, Experiment #2 is an RNAseq project where the investigator is interested in alternative splicing. We analyzed one earlier data set with 50bp reads with only moderate success. It seems that very deep coverage is needed to get valid data for alt-splicing, especially when levels of a poorly expressed isoform are suspected to change by a small amount due to biological treatment. The investigator saw some published results suggesting that paired-end RNAseq data would provide more information about splicing isoforms. So, WITHOUT a bioinformatics consult, they sent an existing sample (created for 50bp single end sequencing) to the lab for 100 bp paired-end sequencing. This data came out of our pipeline with more than 20% error and a strange mix of incorrectly oriented read pairs (facing outward rather than inward). After a few days of head scratching and escalating levels of Illumina bioinformatics tech support, we have an explanation. A 225 bp library fragment contains 130 bp of primers and adapters. Thus the insert has an average size of about 95 bp. Some are shorter!  Thus, our 100 cyle reads go off the far end of most sequences, adding 5 or more bases of adapter sequence where the alignment software is expecting genomic sequence. In addition, the paired ends overlap more than 100% - so the start of one read is inside the end of the other. Thus they map in the opposite orientation, with an insert size of 5-10 bp. Our best effort to analyze this data will involve chopping all reads back to 36 bp and repeating the Paired-End analysis. So that was 3 days of bioinformatics analysis time not so well spent on forensic QC.

Now we are looking back to Experiment #1 and wondering about insert sizes in that library. What if that library's insert size was about 110 or 120 bp (perhaps with a sizeable tail of much smaller fragments), and a fraction of the reads also run off into the adapter, adding mismatched bases at the ends of alignments, and thus jacking up the overall error rate.

Two conclusions: 1) talk to bioinformatics BEFORE you build your sequencing libraries
2) if you want something done right, do it yourself.