skip to main content


Title: Gaussian Mixture Models for Stochastic Block Models with Non-Vanishing Noise
Community detection tasks have received a lot of attention across statistics, machine learning, and information theory with work concentrating on providing theoretical guarantees for different methodological approaches to the stochastic block model. Recent work on community detection has focused on modeling the spectral embedding of a network using Gaussian mixture models (GMMs) in scaling regimes where the ability to detect community memberships improves with the size of the network. However, these regimes are not very realistic. This paper provides tractable methodology motivated by new theoretical results for networks with non-vanishing noise. We present a procedure for community detection using novel GMMs that incorporate truncation and shrinkage effects. We provide empirical validation of this new representation as well as experimental results using a large email dataset.  more » « less
Award ID(s):
1750362
NSF-PAR ID:
10139969
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP)
Page Range / eLocation ID:
699 to 703
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.

     
    more » « less
  2. null (Ed.)
    Recent advances in computing algorithms and hardware have rekindled interest in developing high-accuracy, low-cost surrogate models for simulating physical systems. The idea is to replace expensive numerical integration of complex coupled partial differential equations at fine time scales performed on supercomputers, with machine-learned surrogates that efficiently and accurately forecast future system states using data sampled from the underlying system. One particularly popular technique being explored within the weather and climate modelling community is the echo state network (ESN), an attractive alternative to other well-known deep learning architectures. Using the classical Lorenz 63 system, and the three tier multi-scale Lorenz 96 system (Thornes T, Duben P, Palmer T. 2017 Q. J. R. Meteorol. Soc. 143 , 897–908. ( doi:10.1002/qj.2974 )) as benchmarks, we realize that previously studied state-of-the-art ESNs operate in two distinct regimes, corresponding to low and high spectral radius (LSR/HSR) for the sparse, randomly generated, reservoir recurrence matrix. Using knowledge of the mathematical structure of the Lorenz systems along with systematic ablation and hyperparameter sensitivity analyses, we show that state-of-the-art LSR-ESNs reduce to a polynomial regression model which we call Domain-Driven Regularized Regression (D2R2). Interestingly, D2R2 is a generalization of the well-known SINDy algorithm (Brunton SL, Proctor JL, Kutz JN. 2016 Proc. Natl Acad. Sci. USA 113 , 3932–3937. ( doi:10.1073/pnas.1517384113 )). We also show experimentally that LSR-ESNs (Chattopadhyay A, Hassanzadeh P, Subramanian D. 2019 ( http://arxiv.org/abs/1906.08829 )) outperform HSR ESNs (Pathak J, Hunt B, Girvan M, Lu Z, Ott E. 2018 Phys. Rev. Lett. 120 , 024102. ( doi:10.1103/PhysRevLett.120.024102 )) while D2R2 dominates both approaches. A significant goal in constructing surrogates is to cope with barriers to scaling in weather prediction and simulation of dynamical systems that are imposed by time and energy consumption in supercomputers. Inexact computing has emerged as a novel approach to helping with scaling. In this paper, we evaluate the performance of three models (LSR-ESN, HSR-ESN and D2R2) by varying the precision or word size of the computation as our inexactness-controlling parameter. For precisions of 64, 32 and 16 bits, we show that, surprisingly, the least expensive D2R2 method yields the most robust results and the greatest savings compared to ESNs. Specifically, D2R2 achieves 68 × in computational savings, with an additional 2 × if precision reductions are also employed, outperforming ESN variants by a large margin. This article is part of the theme issue ‘Machine learning for weather and climate modelling’. 
    more » « less
  3. null (Ed.)
    A network may have weak signals and severe degree heterogeneity, and may be very sparse in one occurrence but very dense in another. SCORE (Ann. Statist. 43, 57–89, 2015) is a recent approach to network community detection. It accommodates severe degree heterogeneity and is adaptive to different levels of sparsity, but its performance for networks with weak signals is unclear. In this paper, we show that in a broad class of network settings where we allow for weak signals, severe degree heterogeneity, and a wide range of network sparsity, SCORE achieves prefect clustering and has the so-called “exponential rate” in Hamming clustering errors. The proof uses the most recent advancement on entry-wise bounds for the leading eigenvectors of the network adjacency matrix. The theoretical analysis assures us that SCORE continues to work well in the weak signal settings, but it does not rule out the possibility that SCORE may be further improved to have better performance in real applications, especially for networks with weak signals. As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE. We investigate SCORE+ with 8 network data sets and found that it outperforms several representative approaches. In particular, for the 6 data sets with relatively strong signals, SCORE+ has similar performance as that of SCORE, but for the 2 data sets (Simmons, Caltech) with possibly weak signals, SCORE+ has much lower error rates. SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study. 
    more » « less
  4. Human mobility models typically produce mobility data to capture human mobility patterns individually or collectively based on real-world observations or assumptions, which are essential for many use cases in research and practice, e.g., mobile networking, autonomous driving, urban planning, and epidemic control. However, most existing mobility models suffer from practical issues like unknown accuracy and uncertain parameters in new use cases because they are normally designed and verified based on a particular use case (e.g., mobile phones, taxis, or mobile payments). This causes significant challenges for researchers when they try to select a representative human mobility model with appropriate parameters for new use cases. In this paper, we introduce a MObility VERification framework called MOVER to systematically measure the performance of a set of representative mobility models including both theoretical and empirical models based on a diverse set of use cases with various measures. Based on a taxonomy built upon spatial granularity and temporal continuity, we selected four representative mobility use cases (e.g., the vehicle tracking system, the camera-based system, the mobile payment system, and the cellular network system) to verify the generalizability of the state-of-the-art human mobility models. MOVER methodically characterizes the accuracy of five different mobility models in these four use cases based on a comprehensive set of mobility measures and provide two key lessons learned: (i) For the collective level measures, the finer spatial granularity of the user cases, the better generalization of the theoretical models; (ii) For the individual-level measures, the lower periodic temporal continuity of the user cases, the theoretical models typically generalize better than the empirical models. The verification results can help the research community to select appropriate mobility models and parameters in different use cases.

     
    more » « less
  5. Recent studies have shown that climate change and global warming considerably increase the risks of hurricane winds, floods, and storm surges in coastal communities. Turbulent processes in Hurricane Boundary Layers (HBLs) play a major role in hurricane dynamics and intensification. Most of the existing turbulence parameterizations in the current numerical weather prediction (NWP) models rely on the Planetary Boundary Layer (PBL) schemes. Previous studies (Zhang 2010; Momen et al. 2021) showed that there is a significant distinction between turbulence characteristics in HBLs and regular atmospheric boundary layers (ABLs) due to the strong rotational effects of hurricane flows. Nevertheless, such differences are not considered in the current schemes of NWPs, and they are primarily designed and tested for regular ABLs. In this talk, we aim to bridge this knowledge gap by conducting new hurricane simulations using the Weather Research and Forecasting (WRF) model as well as large-eddy simulations. We investigate the role of the PBL parameterizations and momentum roughness length in multiple hurricanes by probing the parameter space of the problem. Our simulations have shown that the most widely used WRF PBL schemes do not capture the hurricane intensification properly and underestimate their intensity. We will present that decreasing the roughness length close to the values of observational estimates and theoretical hurricane intensity models in high wind regimes (≳ 45 m s-1) led to significant improvements in the intensity forecasts of strong hurricanes. Furthermore, by decreasing the existing vertical diffusion values, on average more than 20% improvements in hurricane intensity forecasts were obtained compared to the default runs. Our results provide new insights into the role of turbulence parameterizations in hurricane dynamics and can be employed to improve the accuracy of real hurricane forecasts. The implications of these results and improvements for coastal resiliency and fluid-structure interactions will also be discussed. 
    more » « less