skip to main content


Title: Hybrid Clustering of Single-Cell Gene Expression and Spatial Information via Integrated NMF and K-Means
Advances in single cell transcriptomics have allowed us to study the identity of single cells. This has led to the discovery of new cell types and high resolution tissue maps of them. Technologies that measure multiple modalities of such data add more detail, but they also complicate data integration. We offer an integrated analysis of the spatial location and gene expression profiles of cells to determine their identity. We propose scHybridNMF (single-cell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by combining sparse nonnegative matrix factorization (sparse NMF) with k-means clustering to cluster high-dimensional gene expression and low-dimensional location data. We show that, under multiple scenarios, including the cases where there is a small number of genes profiled and the location data is noisy, scHybridNMF outperforms sparse NMF, k-means, and an existing method that uses a hidden Markov random field to encode cell location and gene expression data for cell type identification.  more » « less
Award ID(s):
2019771
NSF-PAR ID:
10319835
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Frontiers in Genetics
Volume:
12
ISSN:
1664-8021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Large, comprehensive collections of single-cell RNA sequencing (scRNA-seq) datasets have been generated that allow for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets or transfer knowledge from one to the other to better understand cellular identity and functions. Here, we present a simple yet surprisingly effective method named common factor integration and transfer learning (cFIT) for capturing various batch effects across experiments, technologies, subjects, and even species. The proposed method models the shared information between various datasets by a common factor space while allowing for unique distortions and shifts in genewise expression in each batch. The model parameters are learned under an iterative nonnegative matrix factorization (NMF) framework and then used for synchronized integration from across-domain assays. In addition, the model enables transferring via low-rank matrix from more informative data to allow for precise identification in data of lower quality. Compared with existing approaches, our method imposes weaker assumptions on the cell composition of each individual dataset; however, it is shown to be more reliable in preserving biological variations. We apply cFIT to multiple scRNA-seq datasets of developing brain from human and mouse, varying by technologies and developmental stages. The successful integration and transfer uncover the transcriptional resemblance across systems. The study helps establish a comprehensive landscape of brain cell-type diversity and provides insights into brain development. 
    more » « less
  2. Abstract

    Nonnegative matrix factorization (NMF) is widely used to analyze high-dimensional count data because, in contrast to real-valued alternatives such as factor analysis, it produces an interpretable parts-based representation. However, in applications such as spatial transcriptomics, NMF fails to incorporate known structure between observations. Here, we present nonnegative spatial factorization (NSF), a spatially-aware probabilistic dimension reduction model based on transformed Gaussian processes that naturally encourages sparsity and scales to tens of thousands of observations. NSF recovers ground truth factors more accurately than real-valued alternatives such as MEFISTO in simulations, and has lower out-of-sample prediction error than probabilistic NMF on three spatial transcriptomics datasets from mouse brain and liver. Since not all patterns of gene expression have spatial correlations, we also propose a hybrid extension of NSF that combines spatial and nonspatial components, enabling quantification of spatial importance for both observations and features. A TensorFlow implementation of NSF is available fromhttps://github.com/willtownes/nsf-paper.

     
    more » « less
  3. Abstract

    We present a machine learning workflow to discover signatures in acoustic measurements that can be utilized to create a low-dimensional model to accurately predict the location of keyhole pores formed during additive manufacturing processes. Acoustic measurements were sampled at 100 kHz during single-layer laser powder bed fusion (LPBF) experiments, and spatio-temporal registration of pore locations was obtained from post-build radiography. Power spectral density (PSD) estimates of the acoustic data were then decomposed using non-negative matrix factorization with custom$$\varvec{k}$$k-means clustering (NMF$$\varvec{k}$$k) to learn the underlying spectral patterns associated with pore formation. NMF$$\varvec{k}$$kreturned a library of basis signals and matching coefficients toblindlyconstruct a feature space based on the PSD estimates in anoptimizedfashion. Moreover, the NMF$$\varvec{k}$$kdecomposition led to the development of computationally inexpensive machine learning models which are capable of quickly and accurately identifying pore formation with classification accuracy of supervised and unsupervised label learning greater than 95% and 90%, respectively. The intrinsic data compression of NMFk, the relatively light computational cost of the machine learning workflow, and the high classification accuracy makes the proposed workflow an attractive candidate for edge computing toward in-situ keyhole pore prediction in LPBF.

     
    more » « less
  4. Abstract. End-member mixing analysis (EMMA) is a method of interpreting stream water chemistry variations and is widely used for chemical hydrograph separation. It is based on the assumption that stream water is a conservative mixture of varying contributions from well-characterized source solutions (end-members). These end-members are typically identified by collecting samples of potential end-member source waters from within the watershed and comparing these to the observations. Here we introduce a complementary data-driven method (convex hull end-member mixing analysis – CHEMMA) to infer the end-member compositions and their associated uncertainties from the stream water observations alone. The method involves two steps. The first uses convex hull nonnegative matrix factorization (CH-NMF) to infer possible end-member compositions by searching for a simplex that optimally encloses the stream water observations. The second step uses constrained K-means clustering (COP-KMEANS) to classify the results from repeated applications of CH-NMF and analyzes the uncertainty associated with the algorithm. In an example application utilizing the 1986 to 1988 Panola Mountain Research Watershed dataset, CHEMMA is able to robustly reproduce the three field-measured end-members found in previous research using only the stream water chemical observations. CHEMMA also suggests that a fourth and a fifth end-member can be (less robustly) identified. We examine uncertainties in end-member identification arising from non-uniqueness, which is related to the data structure, of the CH-NMF solutions, and from the number of samples using both real and synthetic data. The results suggest that the mixing space can be identified robustly when the dataset includes samples that contain extremely small contributions of one end-member, i.e., samples containing extremely large contributions from one end-member are not necessary but do reduce uncertainty about the end-member composition. 
    more » « less
  5. Transcranial magnetic stimulation (TMS) is gaining increasing attention for therapeutic treatment of mental illnesses. However, a clear understanding of its impact to the underlying brain mechanisms is critical for its effective application. For this, we analyze electroencephalography (EEG) response to TMS subthreshold pulse at the left motor cortex from 6 healthy controls and 6 schizophrenia patients. We use principal component analysis (PCA) along sparse nonnegative matrix factorization (NMF), an unsupervised machine learning technique, on brain connectivity data established by sliding window coherence of EEG based source localized data. The source localization was achieved by using the sLORETA algorithm on our EEG data after artifact removal. This, hence, provides high temporal and spatial resolution in the connectivity analysis results, giving advantage over other neuroimaging modalities. PCA aids in establishing the number of common underlying connected subnetworks (say k) across subjects whereas NMF is employed to derive these k spatial and temporal signature subnetwork response to the stimulus. Within these signatures, we studied motor cortical connectivity and found that schizophrenia patients exhibited sensory gating deficits as compared to controls. These findings can act as potential biomarkers to monitor TMS for clinical therapeutic techniques in the future. 
    more » « less