Starting my current position at the Center for Dark Energy Biosphere Investigations, I was quite happy to apply the bioinformatic skills I had developed through graduate school to a ‘large’ metagenomic dataset from a subseafloor aquifer. But as things go, and when you have just enough ‘skill’ to be dangerous, you become disenchanted with the tools other people have created and you say to yourself “I can certainly make a better tool than these”. And so started a process that led to me handing off a half-developed, semi-functional Python script to a graduate student in the lab, Elaina Graham, and, eventually, resulted a year later in a new binning tool, we called BinSanity. I used this new tool on my 21 subseafloor metagenomes and sat there quite happy and content about the new microbial genomes that I had created – for about a week. With this new hammer, I started looking for nails. But within the deep biosphere community coming across large numbers of metagenomes from related samples is rare (and generally being worked on by friends of mine). So through shear serendipity, during a conversation with Dr. Bonnie Hurwitz at the University of Arizona, she mentioned the large metagenomic dataset from ‘Tara’ and maybe BinSanity could be applied to that.
It was yearlong process of downloading and assembling the Tara Oceans metagenomic data. We started with a proof of concept using just the data generated for the Mediterranean Sea and then spent some time refining the BinSanity methodology. But the size of the Tara Oceans metagenomic dataset (7.3TB of compressed data - 102 billion paired reads) can make it unwieldy. Especially when using it on a shared computing resource. We broke up the dataset into smaller parts, and working province-by-province, we assembled and binned the Bacteria and Archaea of the surface oceans. It took about a month per province, including several LONG assembly times – like the 41 actual days (not computing days) needed for the secondary assembly of the North Atlantic.
As the Tara Oceans data was open access, we made it our goal to make all output open access – including methods and the raw data (Interested in the 120GB of our primary assembly? It’s all yours). Before becoming available on NCBI, our dataset was released, along with a bioRxiv preprint, on figshare. Elements of the dataset have already been downloaded 400+ times. And that has been the goal of this project – make sure the dataset can be used by the scientific community for the “next step”. Many big questions about the oceans are limited by a lack of microbial genomes. Projects in involving large-scale metatranscriptomics and metaproteomics can answer questions about which metabolic processes are literally occurring in the oceans and are enhanced by sufficient reference databases. Global oceanic models need microbial genomes to complete the picture of which processes co-occur within the same organism in order to change metabolic black-boxes into discreet model components.
But ultimately this dataset is only one piece of the puzzle. Metagenome assembled genomes are not perfect for a number of reasons: (1) promiscuous data can end up associated with the wrong organism, (2) genomes do not represent reality, instead representing an amalgamation of multiple closely related organisms, (3) in many instances, only part of the genome is recovered, and (4) they currently only capture a fraction of the microbial community (Fig. 1). The application of new or improved binning techniques, further re-analysis of Tara, and the addition of more samples will be required to fully complete the picture for the marine Bacteria and Archaea.
The corresponding published paper in Scientific Data is here: http://go.nature.com/2FNUNpk