Abstract One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.
more »
« less
Gaussian Mixture Models for Stochastic Block Models with Non-Vanishing Noise
Community detection tasks have received a lot of attention across statistics, machine learning, and information theory with work concentrating on providing theoretical guarantees for different methodological approaches to the stochastic block model. Recent work on community detection has focused on modeling the spectral embedding of a network using Gaussian mixture models (GMMs) in scaling regimes where the ability to detect community memberships improves with the size of the network. However, these regimes are not very realistic. This paper provides tractable methodology motivated by new theoretical results for networks with non-vanishing noise. We present a procedure for community detection using novel GMMs that incorporate truncation and shrinkage effects. We provide empirical validation of this new representation as well as experimental results using a large email dataset.
more »
« less
- Award ID(s):
- 1750362
- PAR ID:
- 10139969
- Date Published:
- Journal Name:
- 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)
- Page Range / eLocation ID:
- 699 to 703
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Communities are a common and widely studied structure in networks, typically assum- ing that the network is fully and correctly observed. In practice, network data are often collected by querying nodes about their connections. In some settings, all edges of a sam- pled node will be recorded, and in others, a node may be asked to name its connections. These sampling mechanisms introduce noise and bias, which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on recording edges via querying nodes, designed to improve community detection for network data col- lected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be performed by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. In this case, spectral clustering and the method of mo- ments are computationally ecient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on unweighted and weighted net- works and under misspecified models. The method is applied to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.more » « less
-
We develop a measure for evaluating the performance of generative networks given two sets of images. A popular performance measure currently used to do this is the Fréchet Inception Distance (FID). FID assumes that images featurized using the penultimate layer of Inception-v3 follow a Gaussian distribution, an assumption which cannot be violated if we wish to use FID as a metric. However, we show that Inception-v3 features of the ImageNet dataset are not Gaussian; in particular, every single marginal is not Gaussian. To remedy this problem, we model the featurized images using Gaussian mixture models (GMMs) and compute the 2-Wasserstein distance restricted to GMMs. We define a performance measure, which we call WaM, on two sets of images by using Inception-v3 (or another classifier) to featurize the images, estimate two GMMs, and use the restricted 2-Wasserstein distance to compare the GMMs. We experimentally show the advantages of WaM over FID, including how FID is more sensitive than WaM to imperceptible image perturbations. By modelling the non-Gaussian features obtained from Inception-v3 as GMMs and using a GMM metric, we can more accurately evaluate generative network performance.more » « less
-
This paper contributes a two-step approach to monitor clusters of thermal targets on the ground using unmanned aerial vehicles (UAVs) and Gaussian mixture models (GMMs) in a distributed manner. The approach is tailored to networks of UAVs that establish a flying ad hoc network (FANET) and operate without central command. The first step is a monitoring algorithm that determines if the GMM corresponds to the current spatial distribution of clusters of thermal targets on the ground. UAVs make this determination using local data and a sequence of data exchanges with UAVs that are one-hop neighbors in the FANET. The second step is the calculation of a new GMM when the current GMM is found to be unfit, i.e., the GMM no longer corresponds to the new distribution of clusters on the ground due to the movement of thermal targets. A distributed expectation-maximization algorithm is developed for this purpose, and it operates on local data and data exchanged with one-hop neighbors only. Simulation results evaluate the performance of both algorithms in terms of the number of communication exchanges. This evaluation is completed for an increasing number of clusters of thermal targets and an increasing number of UAVs. The performance is compared with well-known solutions to the monitoring and GMM calculation problems, demonstrating convergence with a lower number of communication exchanges.more » « less
-
SCORE was introduced as a spectral approach to network community detection. Since many networks have severe degree heterogeneity, the ordinary spectral clustering (OSC) approach to community detection may perform unsatisfactorily. SCORE alleviates the effect of degree heterogeneity by introducing a new normalization idea in the spectral domain and makes OSC more effective. SCORE is easy to use and computationally fast. It adapts easily to new directions and sees an increasing interest in practice. In this paper, we review the basics of SCORE, the adaption of SCORE to network mixed membership estimation and topic modeling, and the application of SCORE in real data, including two datasets on the publications of statisticians. We also review the theoretical “ideology” underlying SCORE. We show that in the spectral domain, SCORE converts a simplicial cone to a simplex and provides a simple and direct link between the simplex and network memberships. SCORE attains an exponential rate and a sharp phase transition in community detection, and achieves optimal rates in mixed membership estimation and topic modeling.more » « less
An official website of the United States government

