Streaming adaptations of manifold learning based dimensionality reduction methods, such as
S-Isomap++: Multi manifold learning from streaming data
Manifold learning based methods have been widely used for non-linear dimensionality reduction (NLDR). However, in many practical settings, the need to process streaming data is a challenge for such methods, owing to the high computational complexity involved. Moreover, most methods operate under the assumption that the input data is sampled from a single manifold, embedded in a high dimensional space. We propose a method for streaming NLDR when the observed data is either sampled from multiple manifolds or irregularly sampled from a single manifold. We show that existing NLDR methods, such as Isomap, fail in such situations, primarily because they rely on smoothness and continuity of the underlying manifold, which is violated in the scenarios explored in this paper. However, the proposed algorithm is able to learn effectively in presence of multiple, and potentially intersecting, manifolds, while allowing for the input data to arrive as a massive stream.
more »
« less
- Award ID(s):
- 1651475
- PAR ID:
- 10064701
- Date Published:
- Journal Name:
- IEEE Bigdata 2017
- Page Range / eLocation ID:
- 716 to 725
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract Isomap , are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold. -
The manifold scattering transform is a deep feature extractor for data defined on a Riemannian manifold. It is one of the first examples of extending convolutional neural network-like operators to general manifolds. The initial work on this model focused primarily on its theoretical stability and invariance properties but did not provide methods for its numerical implementation except in the case of two-dimensional surfaces with predefined meshes. In this work, we present practical schemes, based on the theory of diffusion maps, for implementing the manifold scattering transform to datasets arising in naturalistic systems, such as single cell genetics, where the data is a high-dimensional point cloud modeled as lying on a low-dimensional manifold. We show that our methods are effective for signal classification and manifold classification tasks.more » « less
-
We propose a novel, angle-based path metric for the multi-manifold clustering problem. This metric, which we call the largest-angle path distance (LAPD), is computed as a bottleneck path distance in a graph constructed on d-simplices of data points. When data is sampled from a collection of d-dimensional manifolds which may intersect, the method can cluster the manifolds with high accuracy and automatically detect how many manifolds are present. By leveraging fast approximation schemes for bottleneck distance, this method exhibits quasi-linear computational complexity in the number of data points. In addition to being highly scalable, the method outperforms existing algorithms in numerous numerical experiments on intersecting manifolds, and exhibits robustness with respect to noise and curvature in the data.more » « less
-
Image datasets in specialized fields of science, such as biomedicine, are typically smaller than traditional machine learning datasets. As such, they present a problem for training many models. To address this challenge, researchers often attempt to incorporate priors, i.e., external knowledge, to help the learning procedure. Geometric priors, for example, offer to restrict the learning process to the manifold to which the data belong. However, learning on manifolds is sometimes computationally intensive to the point of being prohibitive. Here, we ask a provocative question: is machine learning on manifolds really more accurate than its linear counterpart to the extent that it is worth sacrificing significant speedup in computation? We answer this question through an extensive theoretical and experimental study of one of the most common learning methods for manifold-valued data: geodesic regression.more » « less
-
Gaussian processes (GPs) are very widely used for modeling of unknown functions or surfaces in applications ranging from regression to classification to spatial processes. Although there is an increasingly vast literature on applications, methods, theory and algorithms related to GPs, the overwhelming majority of this literature focuses on the case in which the input domain corresponds to a Euclidean space. However, particularly in recent years with the increasing collection of complex data, it is commonly the case that the input domain does not have such a simple form. For example, it is common for the inputs to be restricted to a non-Euclidean manifold, a case which forms the motivation for this article. In particular, we propose a general extrinsic framework for GP modeling on manifolds, which relies on embedding of the manifold into a Euclidean space and then constructing extrinsic kernels for GPs on their images. These extrinsic Gaussian processes (eGPs) are used as prior distributions for unknown functions in Bayesian inferences. Our approach is simple and general, and we show that the eGPs inherit fine theoretical properties from GP models in Euclidean spaces. We consider applications of our models to regression and classification problems with predictors lying in a large class of manifolds, including spheres, planar shape spaces, a space of positive definite matrices, and Grassmannians. Our models can be readily used by practitioners in biological sciences for various regression and classification problems, such as disease diagnosis or detection. Our work is also likely to have impact in spatial statistics when spatial locations are on the sphere or other geometric spaces.more » « less