skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Robust Spectral Clustering Algorithm for Sub-Gaussian Mixture Models with Outliers
We consider the problem of clustering data sets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for data sets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the “good” data points are generated from a mixture of sub-Gaussians (we term these “inliers”), whereas the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world data sets to demonstrate that our algorithm is less sensitive to outliers compared with other state-of-the-art algorithms proposed in the literature. Funding: G. A. Hanasusanto was supported by the National Science Foundation Grants NSF ECCS-1752125 and NSF CCF-2153606. P. Sarkar gratefully acknowledges support from the National Science Foundation Grants NSF DMS-1713082, NSF HDR-1934932 and NSF 2019844. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2317 .  more » « less
Award ID(s):
2019844 2342505 2343869
PAR ID:
10424351
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Operations Research
Volume:
71
Issue:
1
ISSN:
0030-364X
Page Range / eLocation ID:
224 to 244
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of highdimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points. 
    more » « less
  2. null (Ed.)
    We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points. 
    more » « less
  3. We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points. 
    more » « less
  4. Abstract We develop a convex‐optimization clustering algorithm for heterogeneous financial networks, in the presence of arbitrary or even adversarial outliers. In the stochastic block model with heterogeneity parameters, we penalize nodes whose degree exhibit unusual behavior beyond inlier heterogeneity. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. In absence of any assumption on outliers, they are shown not to hinder the clustering of the inliers. We test the performance of the algorithm on semi‐synthetic heterogenous networks reconstructed to match aggregate data on the Korean financial sector. Our method allows for recovery of sub‐sectors with significantly lower error rates compared to existing algorithms. For overlapping portfolio networks, we uncover a clustering structure supporting diversification effects in investment management. 
    more » « less
  5. We give a combinatorial polynomial-time algorithm to find a maximum weight independent set in perfect graphs of bounded degree that do not contain a prism or a hole of length four as an induced subgraph. An even pair in a graph is a pair of vertices all induced paths between which are even. An even set is a set of vertices every two of which are an even pair. We show that every perfect graph that does not contain a prism or a hole of length four as an induced subgraph has a balanced separator which is the union of a bounded number of even sets, where the bound depends only on the maximum degree of the graph. This allows us to solve the maximum weight independent set problem using the well-known submodular function minimization algorithm. Funding: This work was supported by the Engineering and Physical Sciences Research Council [Grant EP/V002813/1]; the National Science Foundation [Grants DMS-1763817, DMS-2120644, and DMS-2303251]; and Alexander von Humboldt-Stiftung. 
    more » « less