We consider the problem of clustering data sets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for data sets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the “good” data points are generated from a mixture of sub-Gaussians (we term these “inliers”), whereas the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world data sets to demonstrate that our algorithm is less sensitive to outliers compared with other state-of-the-art algorithms proposed in the literature. Funding: G. A. Hanasusanto was supported by the National Science Foundation Grants NSF ECCS-1752125 and NSF CCF-2153606. P. Sarkar gratefully acknowledges support from the National Science Foundation Grants NSF DMS-1713082, NSF HDR-1934932 and NSF 2019844. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2317 . 
                        more » 
                        « less   
                    
                            
                            Clustering heterogeneous financial networks
                        
                    
    
            Abstract We develop a convex‐optimization clustering algorithm for heterogeneous financial networks, in the presence of arbitrary or even adversarial outliers. In the stochastic block model with heterogeneity parameters, we penalize nodes whose degree exhibit unusual behavior beyond inlier heterogeneity. We prove that under mild conditions, this method achieves exact recovery of the underlying clusters. In absence of any assumption on outliers, they are shown not to hinder the clustering of the inliers. We test the performance of the algorithm on semi‐synthetic heterogenous networks reconstructed to match aggregate data on the Korean financial sector. Our method allows for recovery of sub‐sectors with significantly lower error rates compared to existing algorithms. For overlapping portfolio networks, we uncover a clustering structure supporting diversification effects in investment management. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10442026
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- Mathematical Finance
- Volume:
- 34
- Issue:
- 2
- ISSN:
- 0960-1627
- Format(s):
- Medium: X Size: p. 425-466
- Size(s):
- p. 425-466
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            This paper addresses the problem of estimating the positions of points from distance measurements corrupted by sparse outliers. Specifically, we consider a setting with two types of nodes: anchor nodes, for which exact distances to each other are known, and target nodes, for which complete but corrupted distance measurements to the anchors are available. To tackle this problem, we propose a novel algorithm powered by Nystro m method and robust principal component analysis. Our method is computationally efficient as it processes only a localized subset of the distance matrix and does not require distance measurements between target nodes. Empirical evaluations on synthetic datasets, designed to mimic sensor localization, and on molecular experiments, demonstrate that our algorithm achieves accurate recovery with a modest number of anchors, even in the presence of high levels of sparse outliers.more » « less
- 
            null (Ed.)Abstract Non-negative matrix factorization and its extensions were applied to various areas (i.e., dimensionality reduction, clustering, etc.). When the original data are corrupted by outliers and noise, most of non-negative matrix factorization methods cannot achieve robust factorization and learn a subspace with binary codes. This paper puts forward a robust semi-supervised non-negative matrix factorization method for binary subspace learning, called RSNMF, for image clustering. For better clustering performance on the dataset contaminated by outliers and noise, we propose a weighted constraint on the noise matrix and impose manifold learning into non-negative matrix factorization. Moreover, we utilize the discrete hashing learning method to constrain the learned subspace, which can achieve a binary subspace from the original data. Experimental results validate the robustness and effectiveness of RSNMF in binary subspace learning and image clustering on the face dataset corrupted by Salt and Pepper noise and Contiguous Occlusion.more » « less
- 
            We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation systems, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through both simulations and analyses of Peru Legislation dataset.more » « less
- 
            Mining spatiotemporal mobility patterns is crucial for optimizing urban planning, enhancing transportation systems, and improving public safety by providing useful insights into human movement and behavior over space and time. As an unsupervised learning technique, time series clustering has gained considerable attention due to its efficiency. However, the existing literature has often overlooked the inherent characteristics of mobility data, including high-dimensionality, noise, outliers, and time distortions. This oversight can lead to potentially large computational costs and inaccurate patterns. To address these challenges, this paper proposes a novel neural network-based method integrating temporal autoencoder and dynamic time warping-based K-means clustering algorithm to mutually promote each other for mining spatiotemporal mobility patterns. Comparative results showed that our proposed method outperformed several time series clustering techniques in accurately identifying mobility patterns on both synthetic and real-world data, which provides a reliable foundation for data-driven decision-making. Furthermore, we applied the method to monthly county-level mobility data during the COVID-19 pandemic in the U.S., revealing significant differences in mobility changes between rural and urban areas, as well as the impact of public response and health considerations on mobility patterns.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
