skip to main content


Title: SoS-RSC: A Sum-of-Squares Polynomial Approach to Robustifying Subspace Clustering Algorithms
This paper addresses the problem of subspace clustering in the presence of outliers. Typically, this scenario is handled through a regularized optimization, whose computational complexity scales polynomially with the size of the data. Further, the regularization terms need to be manually tuned to achieve optimal performance. To circumvent these difficulties, in this paper we propose an outlier removal algorithm based on evaluating a suitable sum-ofsquares polynomial, computed directly from the data. This algorithm only requires performing two singular value decompositions of fixed size, and provides certificates on the probability of misclassifying outliers as inliers.  more » « less
Award ID(s):
1646121
NSF-PAR ID:
10069445
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2018 IEEE Conf. on Computer Vision and Pattern Recognition
Page Range / eLocation ID:
8033-8041
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper addresses the problem of subspace clustering in the presence of outliers. Typically, this scenario is handled through a regularized optimization, whose computational complexity scales polynomially with the size of the data. Further, the regularization terms need to be manually tuned to achieve optimal performance. To circumvent these difficulties, in this paper we propose an outlier removal algorithm based on evaluating a suitable sum-ofsquares polynomial, computed directly from the data. This algorithm only requires performing two singular value decompositions of fixed size, and provides certificates on the probability of misclassifying outliers as inliers. 
    more » « less
  2. null (Ed.)
    ABSTRACT Rare extragalactic objects can carry substantial information about the past, present, and future universe. Given the size of astronomical data bases in the information era, it can be assumed that very many outlier galaxies are included in existing and future astronomical data bases. However, manual search for these objects is impractical due to the required labour, and therefore the ability to detect such objects largely depends on computer algorithms. This paper describes an unsupervised machine learning algorithm for automatic detection of outlier galaxy images, and its application to several Hubble Space Telescope fields. The algorithm does not require training, and therefore is not dependent on the preparation of clean training sets. The application of the algorithm to a large collection of galaxies detected a variety of outlier galaxy images. The algorithm is not perfect in the sense that not all objects detected by the algorithm are indeed considered outliers, but it reduces the data set by two orders of magnitude to allow practical manual identification. The catalogue contains 147 objects that would be very difficult to identify without using automation. 
    more » « less
  3. Abstract

    This paper investigates robust versions of the general empirical risk minimization algorithm, one of the core techniques underlying modern statistical methods. Success of the empirical risk minimization is based on the fact that for a ‘well-behaved’ stochastic process $\left \{ f(X), \ f\in \mathscr F\right \}$ indexed by a class of functions $f\in \mathscr F$, averages $\frac{1}{N}\sum _{j=1}^N f(X_j)$ evaluated over a sample $X_1,\ldots ,X_N$ of i.i.d. copies of $X$ provide good approximation to the expectations $\mathbb E f(X)$, uniformly over large classes $f\in \mathscr F$. However, this might no longer be true if the marginal distributions of the process are heavy tailed or if the sample contains outliers. We propose a version of empirical risk minimization based on the idea of replacing sample averages by robust proxies of the expectations and obtain high-confidence bounds for the excess risk of resulting estimators. In particular, we show that the excess risk of robust estimators can converge to $0$ at fast rates with respect to the sample size $N$, referring to the rates faster than $N^{-1/2}$. We discuss implications of the main results to the linear and logistic regression problems and evaluate the numerical performance of proposed methods on simulated and real data.

     
    more » « less
  4. We consider the problem of clustering data sets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for data sets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the “good” data points are generated from a mixture of sub-Gaussians (we term these “inliers”), whereas the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world data sets to demonstrate that our algorithm is less sensitive to outliers compared with other state-of-the-art algorithms proposed in the literature. Funding: G. A. Hanasusanto was supported by the National Science Foundation Grants NSF ECCS-1752125 and NSF CCF-2153606. P. Sarkar gratefully acknowledges support from the National Science Foundation Grants NSF DMS-1713082, NSF HDR-1934932 and NSF 2019844. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2317 . 
    more » « less
  5. null (Ed.)
    In this paper, we consider the distributed version of Support Vector Machine (SVM) under the coordinator model, where all input data (i.e., points in [Formula: see text] space) of SVM are arbitrarily distributed among [Formula: see text] nodes in some network with a coordinator which can communicate with all nodes. We investigate two variants of this problem, with and without outliers. For distributed SVM without outliers, we prove a lower bound on the communication complexity and give a distributed [Formula: see text]-approximation algorithm to reach this lower bound, where [Formula: see text] is a user specified small constant. For distributed SVM with outliers, we present a [Formula: see text]-approximation algorithm to explicitly remove the influence of outliers. Our algorithm is based on a deterministic distributed top [Formula: see text] selection algorithm with communication complexity of [Formula: see text] in the coordinator model. 
    more » « less