Shotgun metagenomics, the direct sequencing of all DNA found in an environmental sample, grants access to tremendous amount of data from all domains of life. These data are often awfully large and complex, and intimidate most of those who dare to gaze into their depths. The perpetual arms race between data generation technologies and data analysis strategies has historically been led by the former, but we are at such a period in microbiology, the latter truly crumbles under the weight of the incoming metagenomes. So, it really is OK if you are a microbiologist and feeling overwhelmed. You are not alone.
Tara Oceans Project’s gift to microbiology is one of the most remarkable metagenomic resources yet to study naturally occurring microbes in marine ecosystems. Our study in this issue of Nature Microbiology is one of many that dove into this public treasure. During the last number of years we have benefited from Tara Oceans in many ways as a community, but there is certainly much more to be learned from it. For instance, just from the perspective of genome-resolved insights, non-redundant microbial population genomes we could reconstruct in our study represent only about 7% of all sequences we worked with. Ben Tully and his colleagues also reconstructed genomic bins from Tara, and their collection made sense of about 15%. These numbers are both exciting and depressing. Because we are barely scratching the surface of Tara Oceans Project data the way Tara Oceans Project data is barely scratching the surface of the microbial life in world’s oceans.
One could ignore this revolution and choose a simpler life, but mining metagenomes truly pays back. Let me try to substantiate this a little in the context of our study. Nitrogen is vital to life, as it is one of the main components of the building blocks of the cell, including DNA and proteins. Bioavailability of this element in the surface ocean directly impacts microbial primary productivity, which is responsible for the production of nearly half of the oxygen in the atmosphere. One may say “life made an excellent choice by flourishing here on Earth then, because the vast majority of our planet’s atmosphere is made of nitrogen”. But to life’s inconvenience, nitrogen in the atmosphere occurs as rigid molecules, where two atoms are held together by an extremely strong triple bond, and are unavailable to most biological processes. So, the way microbes that crave for nitrogen look at the atmosphere from the surface ocean is akin to the way a thirsty person looks at the ocean surface from the shores of Namib desert. That’s why microbes that are able take the atmospheric dinitrogen gas and convert it into more accessible forms are so important for a properly functioning Earth. In our study we discovered very abundant, yet previously unrecognized microbial populations in the surface ocean that have the capacity to ‘fix’ nitrogen. All things considered, it is intriguing that these particular microbes waited this long for an opportunity to introduce themselves to us. Is the new breed of computer scientists with their geeky t-shirts and fancy computational toys are coming to get you? Maybe. But maybe there is a message here about the importance of unleashing the wrath of microbiologists on ever-growing datasets by giving them the power to mine these public resources themselves. Because it wasn’t a computer scientist who got this story out of Tara Oceans. It was a microbiologist holding a powerful tool.
We sorely need microbiologists to be able to ask their creative questions directly to complex data. And not by making them do the same analyses over and over again to get the same Excel spreadsheets with different numbers, but by empowering them to be able to tailor and implement analysis strategies inspired by their spontaneous questions on-the-fly, and communicate their findings in a reproducible and reusable form without any direct help from the computational folk. You may say "yeah, it is easier said than done, geeky t-shirts guy who can code in snake language". While I understand why you would say that, I respectfully disagree. Because in addition to nitrogen fixation in the surface ocean, our study taught me what microbiologists can do when they are given software platforms that can support their complex thinking.
In our study we were able to reconstruct nearly a thousand population genomes from surface ocean metagenomes .. and by 'we', I mean Tom Delmont. He single-handedly reconstructed all of them using a hybrid approach that combines automatic binning and manual refinement. He then screened them to ensure their level of purity so we could know confidently to which populations those nitrogen fixation pathways belonged when we needed to. Here, Tom is refining population genome bins interactively using anvi’o, and explaining to me enthusiastically how much he enjoys being photographed while working:
This work was a significant undertaking, and it would have been impossible to complete without the valuable contributions of Christopher Quince (University of Warwick), Alon Shaiber (University of Chicago), Sonny T. M. Lee (University of Chicago), Michael S. Rappé (University of Hawaii at Mānoa), Sandra L. MacLellan (University of Wisconsin-Milwaukee), and Sebastian Lücker (Radboud University). Getting this story out was very important, but it was also equally important for us to develop anvi’o in the process in such a way that it would be possible for others to use it for similar purposes without our help. This required Özcan Esen, Tom, and myself to work very closely. The long hours we spent together taught Özcan and I a lot about what microbiologists needed to be able to do their own complex analyses, and forced us to explore various design principles to investigate what we, software experts, could do to facilitate that.
In my opinion developing software platforms with microbiologists is much more challenging, yet much more needed than developing software platforms for microbiologists. And so we took the longer route. For instance, Özcan and I did not help Tom by analyzing the data for him, but we helped him by making anvi’o more powerful and able, so he could use his training in microbiology and microbial ecology to investigate the data himself. Binning and refining population genomes from Tara Oceans metagenomes of course was only the beginning of the actual research that followed. Tom then investigated the distribution of refined population genomes across metagenomes, characterized their functional potential, studied the organization of genes in interesting pathways, put them in evolutionary contexts, investigated single-nucleotide variants in metagenomes to make sense of the heterogeneity of environmental populations, and did another few dozen things. While we made anvi’o able to deliver these things, we did it openly so we are not the owners of our source code but it is as much as yours as it is ours. We shared our analyses and workflows openly and in reproducible forms so others scrutinize our methods and guide us. We wrote extensive tutorials and blog posts so microbiologists elsewhere could deal with their metagenomes and pangenomes without needing us. In fact we did it so openly, a lovely paper that used the recovery of single-amino acid variants from metagenomes, one of the analytical concepts we have been implementing in anvi’o, got published before ours appeared in conventional outlets. Some of those who cared about our careers thought we were naïve. It felt right. I hope we can remain that way, and be able to say “everything was beautiful, and nothing hurt” at the end, if it comes to that.
Our journey of almost two years through the ocean of Tara Oceans data resulted in a myriad of interconnected little tools that can talk to each other through powerful data structures in an open-source software habitat that is supported by over 40,000 lines of code. The rapidly growing functionality crammed into anvi’o with every new release is not only driven by the needs of our own science in the lab, but it is also inspired by the needs of our early-adopters, who make anvi’o sound pleasant and easy. But in reality anvi’o is not easy, and it is much less pleasant compared to how not easy it is. In fact, I think it comes with quite a miserable combination of things for those who are in a rush: a steep learning curve, many tutorials to read, and no boilerplate data analysis strategies hidden behind ‘next, next, finish’ buttons or copy-paste commands. There are certainly much easier ways to work with metagenomic data. But in anvi’o’s defense, working with metagenomes the way microbiologists should be able to work with them will unlikely be fully addressed by cutting corners of biology for the sake of computational conveniences, and by catering ‘easy’ to the curious. Easy beginnings often suffocate curiosity at the end (and I would like to thank all video game developers for preparing me for this moment).
Many groups work towards developing more effective and more open tools for current data analysis needs. Many of these groups also believe in the power of reproducible workflows and open-science practices to get the most out of the data generated with our scarce public resources. We follow their footsteps, and do our best to develop anvi’o with these principles in mind. In addition, perhaps because we wander in the gray zone of where computation and microbiology truly meets, we try to put an extra effort to let the microbiologists run the show. If you are a microbiologist today, it is extremely likely publicly available metagenomes can tremendously influence whatever you are studying. Binning genomes from metagenomes is only a fraction of many ways metagenomes can substantiate observations emerging in the lab with naturally occurring microbial life in marine environments, terrestrial habitats, or mammalian guts. If you are having hard time seeing how shotgun metagenomes could ever impact your research, invite your local Assistant Professor who is known for her love for metagenomes for a coffee to discuss your research. Assistant Professors love procrastinating over coffee. So you are probably golden.