The K-subspaces (KSS) method is a generalization of the K-means method for subspace clustering. In this work, we present local convergence analysis and a recovery guarantee for KSS, assuming data are generated by the semi-random union of subspaces model, where N points are randomly sampled from K ≥ 2 overlapping subspaces. We show that if the initial assignment of the KSS method lies within a neighborhood of a true clustering, it converges at a superlinear rate and finds the correct clustering within (log logN) iterations with high probability. Moreover, we propose a thresholding inner-product based spectral method for initialization and prove that it produces a point in this neighborhood. We also present numerical results of the studied method to support our theoretical developments.
more »
« less
Subspace clustering using ensembles of K -subspaces
Abstract Subspace clustering is the unsupervised grouping of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles of the $$K$$-subspace (KSS) algorithm via the evidence accumulation clustering framework. Our algorithm, referred to as ensemble $$K$$-subspaces (EKSSs), forms a co-association matrix whose $(i,j)$th entry is the number of times points $$i$$ and $$j$$ are clustered together by several runs of KSS with random initializations. We prove general recovery guarantees for any algorithm that forms an affinity matrix with entries close to a monotonic transformation of pairwise absolute inner products. We then show that a specific instance of EKSS results in an affinity matrix with entries of this form, and hence our proposed algorithm can provably recover subspaces under similar conditions to state-of-the-art algorithms. The finding is, to the best of our knowledge, the first recovery guarantee for evidence accumulation clustering and for KSS variants. We show on synthetic data that our method performs well in the traditionally challenging settings of subspaces with large intersection, subspaces with small principal angles and noisy data. Finally, we evaluate our algorithm on six common benchmark datasets and show that unlike existing methods, EKSS achieves excellent empirical performance when there are both a small and large number of points per subspace.
more »
« less
- PAR ID:
- 10206908
- Date Published:
- Journal Name:
- Information and Inference: A Journal of the IMA
- ISSN:
- 2049-8772
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Motivated by applications to classification problems on metric data, we study Weighted Metric Clustering problem: given a metric d over n points and a k x k symmetric matrix A with non-negative entries, the goal is to find a k-partition of these points into clusters C1,...,Ck, while minimizing the sum of A[i,j] * d(u,v) over all pairs of clusters Ci and Cj and all pairs of points u from Ci and v from Cj. Specific choices of A lead to Weighted Metric Clustering capturing well-studied graph partitioning problems in metric spaces, such as Min-Uncut, Min-k-Sum, Min-k-Cut, and more.Our main result is that Weighted Metric Clustering admits a polynomial-time approximation scheme (PTAS). Our algorithm handles all the above problems using the Sherali-Adams linear programming relaxation. This subsumes several prior works, unifies many of the techniques for various metric clustering objectives, and yields a PTAS for several new problems, including metric clustering on manifolds and a new family of hierarchical clustering objectives. Our experiments on the hierarchical clustering objective show that it better captures the ground-truth structural information compared to the popular Dasgupta's objective.more » « less
-
Discovering and clustering subspaces in high-dimensional data is a fundamental problem of machine learning with a wide range of applications in data mining, computer vision, and pattern recognition. Earlier methods divided the problem into two separate stages of finding the similarity matrix and finding clusters. Similar to some recent works, we integrate these two steps using a joint optimization approach. We make the following contributions: (i) we estimate the reliability of the cluster assignment for each point before assigning a point to a subspace. We group the data points into two groups of “certain” and “uncertain”, with the assignment of latter group delayed until their subspace association certainty improves. (ii) We demonstrate that delayed association is better suited for clustering subspaces that have ambiguities, i.e. when subspaces intersect or data are contaminated with outliers/noise. (iii) We demonstrate experimentally that such delayed probabilistic association leads to a more accurate self-representation and final clusters. The proposed method has higher accuracy both for points that exclusively lie in one subspace, and those that are on the intersection of subspaces. (iv) We show that delayed association leads to huge reduction of computational cost, since it allows for incremental spectral clusteringmore » « less
-
Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) are fundamental methods in machine learning for dimensionality reduction. The former is a technique for finding this approximation in finite dimensions and the latter is often in an infinite dimensional Reproducing Kernel Hilbert-space (RKHS). In this paper, we present a geometric framework for computing the principal linear subspaces in both situations as well as for the robust PCA case, that amounts to computing the intrinsic average on the space of all subspaces: the Grassmann manifold. Points on this manifold are defined as the subspaces spanned by K -tuples of observations. The intrinsic Grassmann average of these subspaces are shown to coincide with the principal components of the observations when they are drawn from a Gaussian distribution. We show similar results in the RKHS case and provide an efficient algorithm for computing the projection onto the this average subspace. The result is a method akin to KPCA which is substantially faster. Further, we present a novel online version of the KPCA using our geometric framework. Competitive performance of all our algorithms are demonstrated on a variety of real and synthetic data sets.more » « less
-
We extend the self-organizing mapping algorithm to the problem of visualizing data on Grassmann manifolds. In this setting, a collection of k points in n-dimensions is represented by a k-dimensional subspace, e.g., via the singular value or QR-decompositions. Data assembled in this way is challenging to visualize given abstract points on the Grassmannian do not reside in Euclidean space. The extension of the SOM algorithm to this geometric setting only requires that distances between two points can be measured and that any given point can be moved towards a presented pattern. The similarity between two points on the Grassmannian is measured in terms of the principal angles between subspaces, e.g., the chordal distance. Further, we employ a formula for moving one subspace towards another along the shortest path, i.e., the geodesic between two points on the Grassmannian. This enables a faithful implementation of the SOM approach for visualizing data consisting of k-dimensional subspaces of n-dimensional Euclidean space. We illustrate the resulting algorithm on a hyperspectral imaging application.more » « less