skip to main content

Title: Advances in neural information processing systems
This paper extends robust principal component analysis (RPCA) to nonlinear manifolds. Suppose that the observed data matrix is the sum of a sparse component and a component drawn from some low dimensional manifold. Is it possible to separate them by using similar ideas as RPCA? Is there any benefit in treating the manifold as a whole as opposed to treating each local region independently? We answer these two questions affirmatively by proposing and analyzing an optimization framework that separates the sparse component from the manifold under noisy data. Theoretical error bounds are provided when the tangent spaces of the manifold satisfy certain incoherence conditions. We also provide a near optimal choice of the tuning parameters for the proposed optimization formulation with the help of a new curvature estimation method. The efficacy of our method is demonstrated on both synthetic and real datasets.
; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Advances in neural information processing systems
Sponsoring Org:
National Science Foundation
More Like this
  1. Sparse principal component analysis and sparse canonical correlation analysis are two essential techniques from high-dimensional statistics and machine learning for analyzing large-scale data. Both problems can be formulated as an optimization problem with nonsmooth objective and nonconvex constraints. Because nonsmoothness and nonconvexity bring numerical difficulties, most algorithms suggested in the literature either solve some relaxations of them or are heuristic and lack convergence guarantees. In this paper, we propose a new alternating manifold proximal gradient method to solve these two high-dimensional problems and provide a unified convergence analysis. Numerical experimental results are reported to demonstrate the advantages of our algorithm.
  2. Spectral clustering is one of the fundamental unsupervised learning methods and is widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering, and it improves the interpretability of the model. One widely adopted model for SSC in the literature is an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objective. Such an optimization problem is very challenging to solve. Existing methods usually solve its convex relaxation or need to smooth its nonsmooth objective using certain smoothing techniques. Therefore, they were not targeting solving the original formulation of SSC. In this paper, we proposemore »a manifold proximal linear method (ManPL) that solves the original SSC formulation without twisting the model. We also extend the algorithm to solve multiple-kernel SSC problems, for which an alternating ManPL algorithm is proposed. Convergence and iteration complexity results of the proposed methods are established. We demonstrate the advantage of our proposed methods over existing methods via clustering of several data sets, including University of California Irvine and single-cell RNA sequencing data sets.« less
  3. Since the cost of labeling data is getting higher and higher, we hope to make full use of the large amount of unlabeled data and improve image classification effect through adding some unlabeled samples for training. In addition, we expect to uniformly realize two tasks, namely the clustering of the unlabeled data and the recognition of the query image. We achieve the goal by designing a novel sparse model based on manifold assumption, which has been proved to work well in many tasks. Based on the assumption that images of the same class lie on a sub-manifold and an imagemore »can be approximately represented as the linear combination of its neighboring data due to the local linear property of manifold, we proposed a sparse representation model on manifold. Specifically, there are two regularizations, i.e., a variant Trace lasso norm and the manifold Laplacian regularization. The first regularization term enables the representation coefficients satisfying sparsity between groups and density within a group. And the second term is manifold Laplacian regularization by which label can be accurately propagated from labeled data to unlabeled data. Augmented Lagrange Multiplier (ALM) scheme and Gauss Seidel Alternating Direction Method of Multiplier (GS-ADMM) are given to solve the problem numerically. We conduct some experiments on three human face databases and compare the proposed work with several state-of-the-art methods. For each subject, some labeled face images are randomly chosen for training for those supervised methods, and a small amount of unlabeled images are added to form the training set of the proposed approach. All experiments show our method can get better classification results due to the addition of unlabeled samples.« less
  4. Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper, we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks, such as biclustering with sparse singularmore »value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulations and real data examples.« less
  5. Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak’s Heavy Ball method (HB) and Nesterov’s method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation inmore »which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase.« less