skip to main content


Search for: All records

Award ID contains: 2013905

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.

     
    more » « less
  2. Free, publicly-accessible full text available August 1, 2024
  3. Free, publicly-accessible full text available June 1, 2024
  4. Alber, Mark (Ed.)
    Multi-view data can be generated from diverse sources, by different technologies, and in multiple modalities. In various fields, integrating information from multi-view data has pushed the frontier of discovery. In this paper, we develop a new approach for multi-view clustering, which overcomes the limitations of existing methods such as the need of pooling data across views, restrictions on the clustering algorithms allowed within each view, and the disregard for complementary information between views. Our new method, called CPS-merge analysis , merges clusters formed by the Cartesian product of single-view cluster labels, guided by the principle of maximizing clustering stability as evaluated by CPS analysis. In addition, we introduce measures to quantify the contribution of each view to the formation of any cluster. CPS-merge analysis can be easily incorporated into an existing clustering pipeline because it only requires single-view cluster labels instead of the original data. We can thus readily apply advanced single-view clustering algorithms. Importantly, our approach accounts for both consensus and complementary effects between different views, whereas existing ensemble methods focus on finding a consensus for multiple clustering results, implying that results from different views are variations of one clustering structure. Through experiments on single-cell datasets, we demonstrate that our approach frequently outperforms other state-of-the-art methods. 
    more » « less
  5. The architectures of many neural networks rely heavily on the underlying grid associated with the variables, for instance, the lattice of pixels in an image. For general biomedical data without a grid structure, the multi‐layer perceptron (MLP) and deep belief network (DBN) are often used. However, in these networks, variables are treated homogeneously in the sense of network structure; and it is difficult to assess their individual importance. In this paper, we propose a novel neural network called Variable‐block tree Net (VtNet) whose architecture is determined by an underlying tree with each node corresponding to a subset of variables. The tree is learned from the data to best capture the causal relationships among the variables. VtNet contains a long short‐term memory (LSTM)‐like cell for every tree node. The input and forget gates of each cell control the information flow through the node, and they are used to define a significance score for the variables. To validate the defined significance score, VtNet is trained using smaller trees with variables of low scores removed. Hypothesis tests are conducted to show that variables of higher scores influence classification more strongly. Comparison is made with the variable importance score defined in Random Forest from the aspect of variable selection. Our experiments demonstrate that VtNet is highly competitive in classification accuracy and can often improve accuracy by removing variables with low significance scores. 
    more » « less