We develop a measure for evaluating the performance of generative networks given two sets of images. A popular performance measure currently used to do this is the Fréchet Inception Distance (FID). FID assumes that images featurized using the penultimate layer of Inception-v3 follow a Gaussian distribution, an assumption which cannot be violated if we wish to use FID as a metric. However, we show that Inception-v3 features of the ImageNet dataset are not Gaussian; in particular, every single marginal is not Gaussian. To remedy this problem, we model the featurized images using Gaussian mixture models (GMMs) and compute the 2-Wasserstein distance restricted to GMMs. We define a performance measure, which we call WaM, on two sets of images by using Inception-v3 (or another classifier) to featurize the images, estimate two GMMs, and use the restricted 2-Wasserstein distance to compare the GMMs. We experimentally show the advantages of WaM over FID, including how FID is more sensitive than WaM to imperceptible image perturbations. By modelling the non-Gaussian features obtained from Inception-v3 as GMMs and using a GMM metric, we can more accurately evaluate generative network performance.
more »
« less
Multisource Single-Cell Data Integration by MAW Barycenter for Gaussian Mixture Models
Abstract One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.
more »
« less
- Award ID(s):
- 2013905
- PAR ID:
- 10485797
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Biometrics
- Volume:
- 79
- Issue:
- 2
- ISSN:
- 0006-341X
- Format(s):
- Medium: X Size: p. 866-877
- Size(s):
- p. 866-877
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Gaussian Mixture Models (GMM) are an effective representation of resource uncertainty in power systems planning, as they can be tractably incorporated within stochastic optimization models. However, the skewness, multimodality, and bounded physical support of long-term wind power forecasts can entail requiring a large number of mixture components to achieve a good fit, leading to complex optimization problems. We propose a probabilistic model for wind generation uncertainty to address this challenge, termed Discrete-Gaussian Mixture Model (DGMM), that combines continuous Gaussian components with discrete masses. The model generalizes classical GMMs that have been widely used to estimate wind power outputs. We employ a modified Expectation-Maximization algorithm (called FixedEM) to estimate the parameters of the DGMM. We provide empirical results on the ACTIVSg2000 synthetic wind generation dataset, where we demonstrate that the fitted DGMM is capable of capturing the high frequencies of time windows when wind generating units are either producing at maximum capacity or not producing any power at all. Furthermore, we find that the Bayesian Information Criterion of the DGMM is significantly lower compared to that of existing GMMs using the same number of Gaussian components. This improvement is particularly advantageous when the allowed number of Gaussian components is limited, facilitating the efficient solution to optimization problems for long-term planning.more » « less
-
Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable, by introducing a novel framework involving clustering overfitted parametric (i.e. misspecified) mixture models. These identifiability conditions generalize existing conditions in the literature, and are flexible enough to include for example mixtures of Gaussian mixtures. In contrast to the recent literature on estimating nonparametric mixtures, we allow for general nonparametric mixture components, and instead impose regularity assumptions on the underlying mixing measure. As our primary application, we apply these results to partition-based clustering, generalizing the notion of a Bayes optimal partition from classical parametric model-based clustering to nonparametric settings. Furthermore, this framework is constructive so that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples on real data. The key conceptual device in the analysis is the convex, metric geometry of probability measures on metric spaces and its connection to the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.more » « less
-
This paper contributes a two-step approach to monitor clusters of thermal targets on the ground using unmanned aerial vehicles (UAVs) and Gaussian mixture models (GMMs) in a distributed manner. The approach is tailored to networks of UAVs that establish a flying ad hoc network (FANET) and operate without central command. The first step is a monitoring algorithm that determines if the GMM corresponds to the current spatial distribution of clusters of thermal targets on the ground. UAVs make this determination using local data and a sequence of data exchanges with UAVs that are one-hop neighbors in the FANET. The second step is the calculation of a new GMM when the current GMM is found to be unfit, i.e., the GMM no longer corresponds to the new distribution of clusters on the ground due to the movement of thermal targets. A distributed expectation-maximization algorithm is developed for this purpose, and it operates on local data and data exchanged with one-hop neighbors only. Simulation results evaluate the performance of both algorithms in terms of the number of communication exchanges. This evaluation is completed for an increasing number of clusters of thermal targets and an increasing number of UAVs. The performance is compared with well-known solutions to the monitoring and GMM calculation problems, demonstrating convergence with a lower number of communication exchanges.more » « less
-
The information bottleneck (IB) approach to clustering takes a joint distribution [Formula: see text] and maps the data [Formula: see text] to cluster labels [Formula: see text], which retain maximal information about [Formula: see text] (Tishby, Pereira, & Bialek, 1999 ). This objective results in an algorithm that clusters data points based on the similarity of their conditional distributions [Formula: see text]. This is in contrast to classic geometric clustering algorithms such as [Formula: see text]-means and gaussian mixture models (GMMs), which take a set of observed data points [Formula: see text] and cluster them based on their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse & Schwab, 2017 ), a variant of IB, to perform geometric clustering by choosing cluster labels that preserve information about data point location on a smoothed data set. We also introduce a novel intuitive method to choose the number of clusters via kinks in the information curve. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to [Formula: see text]-means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an information-theoretic perspective on these classic algorithms.more » « less
An official website of the United States government
