May 27, 2016

Functional Metagenomics from shotgun sequences

A number of groups at our research center have recently become interested in metagenomic shotgun sequencing (MGS), which is simply taking samples that are presumed to contain some microbes, extracting DNA and sequencing all of it, shotgun style. This is seen as an improvement over metagenomics methods that amplify 16S ribosomal RNA genes using bacteria specific PCR primers, and then sequence these PCR products. The 16S approach has had a lot of success, in that essentially all of the important "microbiome" publications over the past 5 years have been based on the 16S method. The 16S approach has a number of advantages – the amount of sequencing effort per sample is quite small (1,000 to 10,000 sequences is usually considered adequate) and fairly robust computational methods have been developed to process the sequence data into abundance counts for taxonomic groups of bacteria.

There are a number of drawbacks for the 16S method. The data is highly biased by DNA extraction methods, PCR primers and conditions, DNA sequencing technology, and computational methods used to clean, trim, cluster, and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the accuracy of the taxonomic calls are not very good or reliable. It is simply not possible to identify with high precision and accuracy all bacterial  species (or strains) present in a DNA sample with just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences in bacterial abundance at the genus or even the phylum level.

Two samples may be discovered to have reproducibledifferences in 16S sequence content, but the actual bacterial species or strains that differ are not confidently identified.  Even more important, the low resolution of 16S studies may be missing a lot of important biology. Huge changes in environment can perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this difference in species or strain may bring important changes in metabolite flux, immune system interaction, etc.

It has been proposed that bulk metagenomic shotgun sequencing(MGS) of all of the DNA in a biosample, rather than just PCR amplified 16S sequences, would allow for more precise species and strain identification, and quantification of the actual microbial genes present. Some MGS methods also attempt to count genes in specific functional groups such as within the Gene Ontology, or KEGG pathways. Other people would like to discover completely novel genes that innovative bacteria have developed to do interesting metabolic things. This leads to a computationally hard problem. MGS data tend to be very large (200 million to 1 billion reads per sample), and databases of bacterial genes are incomplete.  In fact, we probably have complete genome sequences for very much less than 0.1 percent of all bacteria in the world. We might also like to identify DNA from archea, viruses, and small eukaryotes in our samples.

Each fragment of DNA in an MGS data file has to be identified based on some type of inexact matching to some set of reference genes or genomes (a sequence alignment problem), which is computationally very demanding. In my experience, BLAST is the most sensitive tool to match diverged DNA sequences, but depending on the size of the database, it takes from 0.1 to 10 cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never tried a billion.  Multiply that by a billion reads per sample and you can see that we have a serious compute challenge. We have at least 200 samples with FASTQ files queued up for analysis, more being sequenced, and more investigators are preparing to start new studies. So we need a scalable solution that cannot be solved just by brute force BLAST searching on ever bigger collections of computers.

As more investigators have become interested in MGS, computing methods for processing this data have popped up like weeds. It's really difficult to review/benchmark all of these tools, since there are no clear boundaries for what analysis results they should deliver, and what methods they should be using. The omictools.com website lists 12 tools for "metagenomics gene prediction" and 5 for "metagenomics functional annotation" however these categories might be defined.  The best benchmark paper I could find (Gardner et al 2015) looks at about 14.

Should the data be primer and quality trimmed (YES!), human and other contaminants removed (YES!), duplicates removed (maybe not???), clustered (????), assembled into contigs or complete genomes. Then what sort of database should the fragments be aligned to? The PFAM library of protein motifs, a set of complete bacterial genomes, some set of candidate genes?

So far, the most successful tools focus on the taxonomy/abundance problem – choosing some subset of the sequence data and comparing it to some set of reference sequences. I have chosen MetaPhalAn <http://www.ncbi.nlm.nih.gov/pubmed/22688413> to process our data because it does well in benchmark studies, runs quickly on our data, and Curtis Huttenhower has a superb track record of producing excellent bioinformatics software. In addition, we removed primers and quality trimmed using Trimmomatic, then removed human sequences by using Bowtie2 to align to the human reference genome (as much as 90% of the data in some samples). Using MetaPhalAn on the cleaned data files, we got species/abundance counts for our ~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean that all the work was done by Hao Chen, my excellent Bioinformatics Programmer ]


  1. An intentionally fuzzy top species abundance heatmap of pre-publication data for some MGS samples processed with MetaPhalAn. If we added info about which samples came from which patients, this might be considerably more interesting. 


Moving on, our next objective is to identify microbial protein coding genes and metabolic pathways. Here, the methods become much more challenging, and difficult to benchmark. I basically don’t like the idea of assembling reads into contigs. This adds a lot of compute time, huge inconsistency among different samples (some will assemble better than others), and creates all kinds of bias for various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a MetaCyc pathway), could differ by  80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data).  So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples? 

If I have to design the gene content identifier myself, I will probably translate all reads in 6 frames, then use hmmscan vs. the PFAM library of known protein functional domains. The upside is that I know how to do this, and it is reasonably sensitive for most types of proteins. The downside is that many proteins will receive a very general function (ie: "kinase" or "7-TM domain"), which does not reveal exactly what metabolic functions they are involved in. And of course some 30-40% of proteins will come up with no known function – just like every new genome that we sequence. Another way to do this would be to use BLASTx against the set of bacterial proteins from UniProt. I can speed this up a bit by taking the UniRef 90% identity clusters (which reduces the database size by about 25%). An alternate proposal, which is much more clumsy, slower, and ad hoc, would be to BLASTn against a large set of bacterial genomes. Either all the complete microbial genomes in GenBank, or all the ones collected by the Human Microbiome Project, or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a COG, or KEGG pathway member. 

According to Gardner et al (Sci. Rep. 2015), the best tools for identifying gene content in MGS samples are the free public servers MG-RAST and the EBI Metagenomics portal. So we have submitted some test samples to these. However, the queue is at least 10 days for processing by MG-RAST, so this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project. 

At this point I'm asking blog readers for suggestions for a decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also.