skip to main content

Title: Multiway spherical clustering via degree-corrected tensor block models.
We consider the problem of multiway clus- tering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recom- mendation system, neuroimaging, commu- nity detection, and hypergraph partitions in social networks. The allowance of de- gree heterogeneity provides great flexibility in clustering models, but the extra com- plexity poses significant challenges in both statistics and computation. Here, we de- velop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of an- gle separability, and we characterize three signal-to-noise regimes corresponding to dif- ferent statistical-computational behaviors. In particular, we demonstrate that an intrin- sic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial- time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through both simulations and analyses of Peru Legislation dataset.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
The 25th International Conference on Artificial Intelligence and Statistics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of multiway clustering in the presence of unknown degree​ ​heterogeneity. Such data problems arise commonly in applications such as recommendation systems, neuroimaging, community detection, and hypergraph partitions​ ​in social networks. The allowance of degree heterogeneity provides great flexibility​ ​in clustering models, but the extra complexity poses significant challenges in both​ ​statistics and computation. Here, we develop a degree-corrected tensor block model​ ​with estimation accuracy guarantees. We present the phase transition of clustering​ ​performance based on the notion of angle separability, and we characterize three​ ​signal-to-noise regimes corresponding to different statistical-computational behaviors.​ ​In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges​ ​only for tensors of order three or greater.​ ​Further, we develop an efficient polynomial time algorithm that provably achieves exact​ ​clustering under mild signal conditions. The​ ​efficacy of our procedure is demonstrated​ ​through both simulations and analyses of​ ​Peru Legislation dataset. 
    more » « less
  2. Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. ’20; Mao, Wein ’21; Davis, Diaz, Wang ’21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, despite the aforementioned low-degree and SoS lower bounds. To achieve this, we give an algorithm based on Lenstra–Lenstra–Lovász lattice basis reduction which achieves the statistically-optimal sample complexity of d + 1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be “closed” by “brittle” polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps. 
    more » « less
  3. Abstract High-order clustering aims to identify heterogeneous substructures in multiway datasets that arise commonly in neuroimaging, genomics, social network studies, etc. The non-convex and discontinuous nature of this problem pose significant challenges in both statistics and computation. In this paper, we propose a tensor block model and the computationally efficient methods, high-order Lloyd algorithm (HLloyd), and high-order spectral clustering (HSC), for high-order clustering. The convergence guarantees and statistical optimality are established for the proposed procedure under a mild sub-Gaussian noise assumption. Under the Gaussian tensor block model, we completely characterise the statistical-computational trade-off for achieving high-order exact clustering based on three different signal-to-noise ratio regimes. The analysis relies on new techniques of high-order spectral perturbation analysis and a ‘singular-value-gap-free’ error bound in tensor estimation, which are substantially different from the matrix spectral analyses in the literature. Finally, we show the merits of the proposed procedures via extensive experiments on both synthetic and real datasets. 
    more » « less

    Galaxy clustering measurements are a key probe of the matter density field in the Universe. With the era of precision cosmology upon us, surveys rely on precise measurements of the clustering signal for meaningful cosmological analysis. However, the presence of systematic contaminants can bias the observed galaxy number density, and thereby bias the galaxy two-point statistics. As the statistical uncertainties get smaller, correcting for these systematic contaminants becomes increasingly important for unbiased cosmological analysis. We present and validate a new method for understanding and mitigating both additive and multiplicative systematics in galaxy clustering measurements (two-point function) by joint inference of contaminants in the galaxy overdensity field (one-point function) using a maximum-likelihood estimator (MLE). We test this methodology with Kilo-Degree Survey-like mock galaxy catalogues and synthetic systematic template maps. We estimate the cosmological impact of such mitigation by quantifying uncertainties and possible biases in the inferred relationship between the observed and the true galaxy clustering signal. Our method robustly corrects the clustering signal to the sub-per cent level and reduces numerous additive and multiplicative systematics from $1.5 \sigma$ to less than $0.1\sigma$ for the scenarios we tested. In addition, we provide an empirical approach to identifying the functional form (additive, multiplicative, or other) by which specific systematics contaminate the galaxy number density. Even though this approach is tested and geared towards systematics contaminating the galaxy number density, the methods can be extended to systematics mitigation for other two-point correlation measurements.

    more » « less
  5. null (Ed.)
    A network may have weak signals and severe degree heterogeneity, and may be very sparse in one occurrence but very dense in another. SCORE (Ann. Statist. 43, 57–89, 2015) is a recent approach to network community detection. It accommodates severe degree heterogeneity and is adaptive to different levels of sparsity, but its performance for networks with weak signals is unclear. In this paper, we show that in a broad class of network settings where we allow for weak signals, severe degree heterogeneity, and a wide range of network sparsity, SCORE achieves prefect clustering and has the so-called “exponential rate” in Hamming clustering errors. The proof uses the most recent advancement on entry-wise bounds for the leading eigenvectors of the network adjacency matrix. The theoretical analysis assures us that SCORE continues to work well in the weak signal settings, but it does not rule out the possibility that SCORE may be further improved to have better performance in real applications, especially for networks with weak signals. As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE. We investigate SCORE+ with 8 network data sets and found that it outperforms several representative approaches. In particular, for the 6 data sets with relatively strong signals, SCORE+ has similar performance as that of SCORE, but for the 2 data sets (Simmons, Caltech) with possibly weak signals, SCORE+ has much lower error rates. SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study. 
    more » « less