The STATegra multi-omics data was created to generate a complete, high-quality multi-omics dataset that could support method development by the bioinformatics scientific community.
We choose a highly reproducible biological system to ensure that the availability of biological material to supply multiple omics assays was not a limitation. The Ikaros inducible system in the mouse B3 cell line meets these requirements. Cells can be grown as much as needed to obtain sufficient material to split among several omics experimental protocols. Moreover, well-studied biomarkers were available to monitor by qPCR the differentiation process of different batches to ensure that synchronization of different batches. Still, the STATegra data collection cannot be considered a full complete dataset because of the different needs of different omics library preps and the very nature of the different platforms. For example, chromatin accessibility data requires living material as a starting point for the library preparation, while frozen material could be used for gene expression, methylation and proteomics experiments. In STATegra, library preparations were performed in labs specialized in each omic method, and while the same culture batches could be used for those technologies compatible with frozen material, new differentiation experiments had to be run at the facilities that used living cells. This created disparity of batches across omics. Additionally, some samples failed during the experimental procedures and had to be profiled again to maintain the designed replicates number. Finally, some of the omics were added to the collection at a later stage of the project, when new technologies, such as single-cell methods, became available. This all resulted in multiple and mixed experimental batches being involved in the generation of the entire collection. Batches are important in omics data analysis as they may be the source of technical biases, or because they impact the types of analysis methods that can be used to analyze the data. We have addressed this issue by providing full documentation of the batch-relationship among samples of different omics types. The STATegra project included the STATegraEM annotation system to deal with this sample relationship complexity. We believe this is an important aspect in the creation of large multi-omics datasets that require first, good sample metadata annotation practices, and second, suitable data analysis strategies that are compatible with the sample blocking factors that are unavoidable in such large-scale experiments.
Gomez-Cabrero et al. STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse. Sci Data. 2019 Oct 31;6(1):256. doi: 10.1038/s41597-019-0202-7.