skip to main content


Title: Large sample spectral analysis of graph-based multi-manifold clustering
In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds M = M1 ∪ · · · ∪ MN that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on M with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.  more » « less
Award ID(s):
2005797 2023239
PAR ID:
10425247
Author(s) / Creator(s):
; ;
Editor(s):
Carreira-Perpinan, Miguel
Date Published:
Journal Name:
Journal of machine learning research
ISSN:
1533-7928
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The scattering transform is a multilayered, wavelet-based transform initially introduced as a mathematical model of convolutional neural networks (CNNs) that has played a foundational role in our understanding of these networks’ stability and invariance properties. In subsequent years, there has been widespread interest in extending the success of CNNs to data sets with non- Euclidean structure, such as graphs and manifolds, leading to the emerging field of geometric deep learning. In order to improve our understanding of the architectures used in this new field, several papers have proposed generalizations of the scattering transform for non-Euclidean data structures such as undirected graphs and compact Riemannian manifolds without boundary. Analogous to the original scattering transform, these works prove that these variants of the scattering transform have desirable stability and invariance properties and aim to improve our understanding of the neural networks used in geometric deep learning. In this paper, we introduce a general, unified model for geometric scattering on measure spaces. Our proposed framework includes previous work on compact Riemannian manifolds without boundary and undirected graphs as special cases but also applies to more general settings such as directed graphs, signed graphs, and manifolds with boundary. We propose a new criterion that identifies to which groups a useful representation should be invariant and show that this criterion is sufficient to guarantee that the scattering transform has desirable stability and invariance properties. Additionally, we consider finite measure spaces that are obtained from randomly sampling an unknown manifold. We propose two methods for constructing a data-driven graph on which the associated graph scattering transform approximates the scattering transform on the underlying manifold. Moreover, we use a diffusion-maps based approach to prove quantitative estimates on the rate of convergence of one of these approximations as the number of sample points tends to infinity. Lastly, we showcase the utility of our method on spherical images, a directed graph stochastic block model, and on high-dimensional single-cell data. 
    more » « less
  2. Several data analysis techniques employ similarity relationships between data points to uncover the intrinsic dimension and geometric structure of the underlying data-generating mechanism. In this paper we work under the model assumption that the data is made of random perturbations of feature vectors lying on a low-dimensional manifold. We study two questions: how to define the similarity relationships over noisy data points, and what is the resulting impact of the choice of similarity in the extraction of global geometric information from the underlying manifold. We provide concrete mathematical evidence that using a local regularization of the noisy data to define the similarity improves the ap- proximation of the hidden Euclidean distance between unperturbed points. Furthermore, graph-based objects constructed with the locally regularized similarity function satisfy bet- ter error bounds in their recovery of global geometric ones. Our theory is supported by numerical experiments that demonstrate that the gain in geometric understanding facili- tated by local regularization translates into a gain in classification accuracy in simulated and real data. 
    more » « less
  3. Several data analysis techniques employ similarity relationships between data points to uncover the intrinsic dimension and geometric structure of the underlying data-generating mechanism. In this paper we work under the model assumption that the data is made of random perturbations of feature vectors lying on a low-dimensional manifold. We study two questions: how to define the similarity relationships over noisy data points, and what is the resulting impact of the choice of similarity in the extraction of global geometric information from the underlying manifold. We provide concrete mathematical evidence that using a local regularization of the noisy data to define the similarity improves the approximation of the hidden Euclidean distance between unperturbed points. Furthermore, graph-based objects constructed with the locally regularized similarity function satisfy better error bounds in their recovery of global geometric ones. Our theory is supported by numerical experiments that demonstrate that the gain in geometric understanding facilitated by local regularization translates into a gain in classification accuracy in simulated and real data. 
    more » « less
  4. Convolutional neural networks (CNNs) are revolutionizing imaging science for two- and three-dimensional images over Euclidean domains. However, many data sets are intrinsically non-Euclidean and are better modeled through other mathematical structures, such as graphs or manifolds. This state of affairs has led to the development of geometric deep learning, which refers to a body of research that aims to translate the principles of CNNs to these non-Euclidean structures. In the process, various challenges have arisen, including how to define such geometric networks, how to compute and train them efficiently, and what are their mathematical properties. In this letter we describe the geometric wavelet scattering transform, which is a type of geometric CNN for graphs and manifolds consisting of alternating multiscale geometric wavelet transforms and nonlinear activation functions. As the name suggests, the geometric wavelet scattering transform is an adaptation of the Euclidean wavelet scattering transform, first introduced by S. Mallat, to graph and manifold data. Like its Euclidean counterpart, the geometric wavelet scattering transform has several desirable properties. In the manifold setting these properties include isometric invariance up to a user specified scale and stability to small diffeomorphisms. Numerical results on manifold and graph data sets, including graph and manifold classification tasks as well as others, illustrate the practical utility of the approach. 
    more » « less
  5. M. Ranzato ; A. Beygelzimer ; Y. Dauphin ; P.S. Liang ; J. Wortman Vaughan (Ed.)
    The null space of the k-th order Laplacian Lk, known as the {\em k-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian L0 has spurred new research and applications, such as spectral clustering algorithms with theoretical guarantees and estimators of the Stochastic Block Model. In this work, we investigate the geometry of the k-th homology embedding and focus on cases reminiscent of spectral clustering. Namely, we analyze the {\em connected sum} of manifolds as a perturbation to the direct sum of their homology embeddings. We propose an algorithm to factorize the homology embedding into subspaces corresponding to a manifold's simplest topological components. The proposed framework is applied to the {\em shortest homologous loop detection} problem, a problem known to be NP-hard in general. Our spectral loop detection algorithm scales better than existing methods and is effective on diverse data such as point clouds and images. 
    more » « less