UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization

Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges. Here we present UINMF, a computational tool designed to improve dataset integrations by leveraging unshared features.

Like Comment
Read more

In a moment of exceptional irony, Robert Hooke named the smallest unit of life “cells” because they reminded him of the rectangular, dormitory-style rooms of monasteries. These modest rooms would have been relatively indistinguishable from each other. Similarly, armed only with low-grade microscopic power, cells do not appear to be individually extraordinary or diverse. Yet, there is an incredible depth and breadth to the number and type of cells that compose the human body, and single-cell genomics is playing a pivotal role in helping scientists untangle these cellular identities.

There are several pieces of information that can be used to characterize cell types, including a cell’s genomic, epigenomic, and spatial information. A cell’s genomic and epigenomic information is typically provided by sequencing modalities such as single-cell RNA-seq and single-cell ATAC-seq, respectively. Spatial transcriptomics, named Nature’s Method of the Year in 2020 [1], is helping researchers untangle how a cell’s position in a tissue influences its development and function. 

The information captured by each modality independently gives us incredible insight into not only a cell’s type, but also what characteristics are shared and unshared between it and other cells. Ultimately, a cell should be jointly categorized by its genomic, epigenomic, and spatial characteristics. Although researchers are working to develop assays that encompass measurements from multiple modalities, such as the new 10X multiome kit [2] that jointly delivers both epigenomic and genomic information from the same cell, no single technology simultaneously measures all three features.  

To define cell profiles that consider information from multiple modalities, but may have been measured at different times, or in different samples, computational methods must be utilized. Several current computational tools to perform such integration include Seurat [3], Harmony [4], and LIGER (Linked Inference of Genomic Relationships [5]). Each of these methods combines datasets using features, usually genes, that are shared between datasets.

However, in many cases, datasets have a large amount of features that are unshared, or not present in all datasets (Figure 1). This results in a large amount of information being discarded prior to dataset integration. Intuitively, it makes sense that these unshared features might contain information that has the potential to dramatically improve the results of integration. This was the motivation that inspired us to develop UINMF. 

Figure 1. There are a number of instances in which the lack of shared features between datasets results in a large amount of pertinent information (shown in yellow) being discarded as unusable by algorithms only capable of integrating across shared features. 

The original LIGER framework uses integrative nonnegative matrix factorization (iNMF) to identify key factor loadings that underlie the data. UINMF extends the LIGER algorithm by incorporating an extra matrix, U, that contains features unshared between datasets. Consequently, UINMF integrates datasets using an increased amount of information. Perhaps the most exciting feature of UINMF is that it makes no assumptions about the type of data that can be included as unshared features, allowing UINMF to be applied to a broad variety of problems. We present the utility of UINMF in three distinct scenarios.

When integrating scRNA-seq and snATAC-seq data, we are integrating transcriptomic information with epigenomic information. The transcriptome is constrained to measurements of genes, or gene-centric regions. However, epigenomic measurements can measure both between genes, inter-genic, and gene-centric regions. Using UINMF, we were able to illustrate the benefit of including the intergenic information as unshared features when integrating across modalities.

While spatial transcriptomics is currently one of the most quickly developing areas of single-cell analysis, many spatial transcriptomic datasets still lack the joint sequencing depth and broad panel of genes currently available in scRNA-seq assays. As a result, it is common to integrate spatial transcriptomic and scRNA-seq data to better characterize cellular profiles within a spatial context. However, many spatial transcriptomic datasets, such as STARmap from the mouse frontal cortex [6], measure much fewer genes than scRNA-seq. Using iNMF to integrate STARmap and RNA-seq data, we would be constrained to the 28 genes shared between datasets, discarding the other 28,338 genes measured in the scRNA-seq data [7]. UINMF allows us to utilize the increased dimensionality of the scRNA-seq data to provide additional structure to the dataset integrations, resulting in a better overall result (Figure 2).

Figure 2. The inclusion of unshared features greatly increases the quality of dataset integration for spatial transcriptomic and transcriptomic data.

UINMF can also be used to improve cross-species analysis. Model organisms play an integral role in developing our understanding of biological systems, but identifying the correspondence between the cell types of two species can be difficult. This lack of correspondence is biologically driven, as a large number of genes are not shared between organisms in a one-to-one manner, frustrating efforts to develop a cohesive understanding of biological processes as a whole. A strategy to overcome this lack of correspondence at a minute level is to integrate the cells of two species using whatever shared genes are available. In theory, the large, overarching similarities between the cell types should achieve a decent matching between the same cell types gathered from different organisms. Using only orthologous, i.e. shared, genes between species, discards any unshared genes that may provide key distinctions between cell types within a specific organism. Using UINMF, we can include both orthologous and non-orthologous genes when integrating cross-species data, capturing more nuanced cell types while constructing cell type mappings between the species.  

We originally developed UINMF to accommodate intergenic peaks into dataset integrations, and throughout the initial debugging process, we were focused intently on proving the merit of utilizing UINMF within this paradigm. Once we were confident that UINMF was improving these types of integrations, we began to wonder where else we could utilize UINMF. It was incredibly exciting to brainstorm further applications of UINMF, and decide which to prioritize investigating. The development of UINMF illustrates how important the process of scientific process of exploration is, and how a tool originally developed for a particular problem, when applied creatively, can have a much wider impact than previously envisioned.

UINMF is motivated by resourcefulness. Since current sequencing technologies can not provide complete cellular profiles jointly characterized by transcriptomic, epigenomic, and spatial data, computational methods must be utilized instead. Unlike previously developed approaches, UINMF is able to leverage all of the information available, without having to subset or discard any relevant material. While we present several distinct scenarios where UINMF can be especially powerful, we are excited to continue exploring other avenues for its application. 

You can explore UINMF in greater detail at (https://rdcu.be/cGFpq).

References

1.    Marx, V. Method of the Year: spatially resolved transcriptomics. Nat. Methods 18, 9–14 (2021).
2.    Nuclei Isolation from Complex Tissues for Single Cell Multiome ATAC + Gene Expression Sequencing, Document Number CG000375 Rev B. 10x Genomics (2021).
3.    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
4.    Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
5.    Welch, J. D. et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 177, 1873–1887.e17 (2019).
6.    Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, (2018).
7.    Saunders, A. et al. Molecular Diversity and Specializations among the Cells of the Adult Mouse Brain. Cell 174, 1015–1030.e16 (2018).

April Kriebel

Graduate Student, University of Michigan