One key challenge encountered in singlecell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several singlecell RNAseq data sets than some other popular methods.
more » « less Award ID(s):
 2013905
 NSFPAR ID:
 10485797
 Publisher / Repository:
 Oxford University Press
 Date Published:
 Journal Name:
 Biometrics
 Volume:
 79
 Issue:
 2
 ISSN:
 0006341X
 Format(s):
 Medium: X Size: p. 866877
 Size(s):
 p. 866877
 Sponsoring Org:
 National Science Foundation
More Like this

We develop a measure for evaluating the performance of generative networks given two sets of images. A popular performance measure currently used to do this is the Fréchet Inception Distance (FID). FID assumes that images featurized using the penultimate layer of Inceptionv3 follow a Gaussian distribution, an assumption which cannot be violated if we wish to use FID as a metric. However, we show that Inceptionv3 features of the ImageNet dataset are not Gaussian; in particular, every single marginal is not Gaussian. To remedy this problem, we model the featurized images using Gaussian mixture models (GMMs) and compute the 2Wasserstein distance restricted to GMMs. We define a performance measure, which we call WaM, on two sets of images by using Inceptionv3 (or another classifier) to featurize the images, estimate two GMMs, and use the restricted 2Wasserstein distance to compare the GMMs. We experimentally show the advantages of WaM over FID, including how FID is more sensitive than WaM to imperceptible image perturbations. By modelling the nonGaussian features obtained from Inceptionv3 as GMMs and using a GMM metric, we can more accurately evaluate generative network performance.more » « less

Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable, by introducing a novel framework involving clustering overfitted parametric (i.e. misspecified) mixture models. These identifiability conditions generalize existing conditions in the literature, and are flexible enough to include for example mixtures of Gaussian mixtures. In contrast to the recent literature on estimating nonparametric mixtures, we allow for general nonparametric mixture components, and instead impose regularity assumptions on the underlying mixing measure. As our primary application, we apply these results to partitionbased clustering, generalizing the notion of a Bayes optimal partition from classical parametric modelbased clustering to nonparametric settings. Furthermore, this framework is constructive so that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples on real data. The key conceptual device in the analysis is the convex, metric geometry of probability measures on metric spaces and its connection to the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.more » « less

The information bottleneck (IB) approach to clustering takes a joint distribution [Formula: see text] and maps the data [Formula: see text] to cluster labels [Formula: see text], which retain maximal information about [Formula: see text] (Tishby, Pereira, & Bialek, 1999 ). This objective results in an algorithm that clusters data points based on the similarity of their conditional distributions [Formula: see text]. This is in contrast to classic geometric clustering algorithms such as [Formula: see text]means and gaussian mixture models (GMMs), which take a set of observed data points [Formula: see text] and cluster them based on their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse & Schwab, 2017 ), a variant of IB, to perform geometric clustering by choosing cluster labels that preserve information about data point location on a smoothed data set. We also introduce a novel intuitive method to choose the number of clusters via kinks in the information curve. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to [Formula: see text]means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an informationtheoretic perspective on these classic algorithms.more » « less

Wasserstein Barycenter is a principled approach to represent the weighted mean of a given set of probability distributions, utilizing the geometry induced by optimal transport. In this work, we present a novel scalable algorithm to approximate the Wasserstein Barycenters aiming at highdimensional applications in machine learning. Our proposed algorithm is based on the Kantorovich dual formulation of the Wasserstein2 distance as well as a recent neural network architecture, input convex neural network, that is known to parametrize convex functions. The distinguishing features of our method are: i) it only requires samples from the marginal distributions; ii) unlike the existing approaches, it represents the Barycenter with a generative model and can thus generate infinite samples from the barycenter without querying the marginal distributions; iii) it works similar to Generative Adversarial Model in one marginal case. We demonstrate the efficacy of our algorithm by comparing it with the stateofart methods in multiple experiments.more » « less

We consider the problem of clustering with the longestleg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finitesample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically lowdimensional clusters in highdimensional space, in the presence of a large number of highdimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.more » « less