We introduce a generic scheme for accelerating gradient-based optimization methods in the sense of Nesterov. The approach, called Catalyst, builds upon the inexact accelerated proximal point algorithm for minimizing a convex objective function, and consists of approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. One of the keys to achieve acceleration in theory and in practice is to solve these sub-problems with appropriate accuracy by using the right stopping criterion and the right warm-start strategy. We give practical guidelines to use Catalyst and present a comprehensive analysis of its global complexity. We show that Catalyst applies to a large class of algorithms, including gradient descent, block coordinate descent, incremental algorithms such as SAG, SAGA, SDCA, SVRG, MISO/Finito, and their proximal variants. For all of these methods, we establish faster rates using the Catalyst acceleration, for strongly convex and non-strongly convex objectives. We conclude with extensive experiments showing that acceleration is useful in practice, especially for ill-conditioned problems.
more »
« less
Improved Pathwise Coordinate Descent for Power Penalties
Pathwise coordinate descent algorithms have been used to compute entire solution paths for lasso and other penalized regression problems quickly with great success. They improve upon cold start algorithms by solving the problems that make up the solution path sequentially for an ordered set of tuning parameter values, instead of solving each problem separately. However, extending pathwise coordinate descent algorithms to more the general bridge or power family of penalties is challenging. Faster algorithms for computing solution paths for these penalties are needed because these penalized regression problems can be nonconvex and especially burdensome to solve. In this article, we show that a reparameterization of these penalized regression problems is more amenable to pathwise coordinate descent algorithms. This allows us to improve computation of the mode-thresholding function for penalized regression problems in practice and introduce two separate pathwise algorithms. We show that either pathwise algorithm is faster than the corresponding cold start alternative, and demonstrate that different pathwise algorithms may be more likely to reach better solutions. Supplemental materials for this article are available online.
more »
« less
- Award ID(s):
- 2113079
- PAR ID:
- 10549785
- Publisher / Repository:
- Taylor and Francis Online
- Date Published:
- Journal Name:
- Journal of Computational and Graphical Statistics
- Volume:
- 33
- Issue:
- 1
- ISSN:
- 1061-8600
- Page Range / eLocation ID:
- 310 to 315
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many interesting search problems can be formulated as bi-objective search problems, that is, search problems where two kinds of costs have to be minimized, for example, travel distance and time for transportation problems. Bi-objective search algorithms have to maintain the set of undominated paths from the start state to each state to compute the set of paths from the start state to the goal state that are not dominated by some other path from the start state to the goal state (called the Pareto-optimal solution set). Each time they find a new path to a state s, they perform a dominance check to determine whether this path dominates any of the previously found paths to s or whether any of the previously found paths to s dominates this path. Existing algorithms do not perform these checks efficiently. On the other hand, our Bi-Objective A* (BOA*) algorithm requires only constant time per check. In our experimental evaluation, we show that BOA* can run an order of magnitude (or more) faster than state-of-the-art bi-objective search algorithms, such as NAMOA*, NAMOA*dr, Bi-Objective Dijkstra, and Bidirectional Bi-Objective Dijkstra.more » « less
-
Many machine learning techniques sacrifice convenient computational structures to gain estimation robustness and modeling flexibility. However, by exploring the modeling structures, we find these ``sacrifices'' do not always require more computational efforts. To shed light on such a ``free-lunch'' phenomenon, we study the square-root-Lasso (SQRT-Lasso) type regression problem. Specifically, we show that the nonsmooth loss functions of SQRT-Lasso type regression ease tuning effort and gain adaptivity to inhomogeneous noise, but is not necessarily more challenging than Lasso in computation. We can directly apply proximal algorithms (e.g. proximal gradient descent, proximal Newton, and proximal quasi-Newton algorithms) without worrying about the nonsmoothness of the loss function. Theoretically, we prove that the proximal algorithms combined with the pathwise optimization scheme enjoy fast convergence guarantees with high probability. Numerical results are provided to support our theory.more » « less
-
We present a new class of preconditioned iterative methods for solving linear systems of the form Ax = b. Our methods are based on constructing a low-rank Nyström approximation to A using sparse random matrix sketching. This approximation is used to construct a preconditioner, which itself is inverted quickly using additional levels of random sketching and preconditioning. We prove that the convergence of our methods depends on a natural average condition number of A, which improves as the rank of the Nyström approximation increases. Concretely, this allows us to obtain faster runtimes for a number of fundamental linear algebraic problems: 1. We show how to solve any n × n linear system that is well-conditioned except for k outlying large singular values in Õ (n2.065 + kω) time, improving on a recent result of [Derezmski, Yang, STOC 2024] for all k ≳ n0.78. 2. We give the first Õ (n2 + dλω) time algorithm for solving a regularized linear system (A+λΙ)x = b, where A is positive semidefinite with effective dimension dλ = tr(A(A + λΙ)-1). This problem arises in applications like Gaussian process regression. 3. We give faster algorithms for approximating Schatten p-norms and other matrix norms. For example, for the Schatten 1-norm (nuclear norm), we give an algorithm that runs in Õ (n2.11) time, improving on an Õ (n2.18) method of [Musco et al., ITCS 2018]. All results are proven in the real RAM model of computation. Interestingly, previous state-of-the-art algorithms for most of the problems above relied on stochastic iterative methods, like stochastic coordinate and gradient descent. Our work takes a completely different approach, instead leveraging tools from matrix sketching.more » « less
-
Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, no-overlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the non-smoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under L1, L0, and transformed-L1 penalties, no-overlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity trade-off among the penalties.more » « less
An official website of the United States government

