Dec 4, 2012

Juggling Elephants

We got hit by a hurricane.

Unfortunately that is not some cute metaphor about large amounts of NGS data, but a very real and literal hurricane. NYU Med Center is located on 1st Ave in Manhattan (between 32nd and roughly 26th streets, depending on which buildings you are counting). It overlooks the scenic FDR Drive, the East River, and some nice factories and warehouses in Long Island City, Queens. During Hurricane Sandy (a couple of days before Halloween), the East River hit us with a 15 foot (4.5 meters) storm surge. This was considerably higher than had ever before been recorded here in NYC. It flooded the subway system, train and highway tunnels; and the associated winds knocked out electrical power for many millions of people.

The damage to NYU Med Buildings was truly overwhelming. All of our hospitals had to be evacuated (Tish, Rusk, Bellevue), and power, backup generators, and all associated building services were knocked out. Our Informatics Center kept our computing cluster in a data center located in the sub-basement of the main Medical Sciences Building. This room was flooded to the ceiling (and half-way filled the floor above) for a couple of days with water from the East River mixed with fuel oil, and the contents of the adjacent mouse breeding lab and Gamma Knife medical suite. Needless to say, those computers (700 nodes) were a total loss - bagged and hauled out as toxic waste. Thanks to foresightful and heroic efforts by our Computing Directors, we managed to save the data backup (200 Terabytes of sequencing data on an Isilon cluster). Our Next-Gen Sequencing lab was not directly damaged by the storm, but it is in a building that still has no power, AC, or running water, and the building is still undergoing assessment for asbestos cleanup and determination of what structural repairs will be needed (4 weeks after the storm).

So that little vignette is a preface to a discussion of outsourcing our sequencing and computing in order to maintain Next-Gen sequencing services for NYU research scientists. A number of labs stepped up to offer us help. We are doing HiSeq and MiSeq runs at the New York Genome Center and Memorial Sloan Kettering (thanks guys). We moved our data storage to a data center in New Jersay and we are borrowing some computing power at the NYU Center for Genomics and Systems Biology (thanks guys) and renting some power on the Amazon cloud. Our NGS lab director is moving DNA/RNA samples and prepared sequencing libraries around town by taxi. My challenge is trying to organize the flow of data from the labs back to the investigators (and to keep an archive in our data storage). This represents something like 500 GB of data per HiSeq run, and we are getting two or more of these per week. The time for data transfer is a significant obstacle - either by FTP over open  Internet (and through one or two Firewalls) or by copy onto USB drive (and then copy from USB to our local computers, then copy again to our archive and to remote data processing machines, which may have their own FTP servers). This is why it is starting to feel like we are juggling elephants.

Having run our own NGS lab for over 3 years, we are extremely aware of all of the different types of errors that can occur in the sequencing process (sample prep. pooling of libraries, recording incorrect barcodes in sample sheets,  machine mechanics and fluidics, computational glitches). Therefore we are rather obsessive about QC checking of the data. Maybe we are just working the kinks out of the system, but it seems like almost every run has some problem that needs to re-process the primary data (from the Run folder with its.BCL and other files, not just the final FASTQ). I hope we are not wearing out the patience of our generous collaborators.

Someone may be able to run a fully outsourced NGS lab and bioinformatics computing support service, but for us, it has not been easy.