Multiple-objective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning, such as simultaneously minimizing classification and fairness losses. At an optimum, further optimizing one objective will necessarily increase at least another objective, and decision-makers need to comprehensively explore multiple optima to pin-point one final solution. We address the efficiency of exploring the Pareto front that contains all optima. First, stochastic multi-gradient descent (SMGD) takes time to converge to the Pareto front with large neural networks and datasets. Instead, we explore the Pareto front as a manifold from a few initial optima, based on a predictor-corrector method. Second, for each exploration step, the predictor iteratively solves a large-scale linear system that scales quadratically in the number of model parameters, and requires one backpropagation to evaluate a second-order Hessian-vector product per iteration of the solver. We propose a Gauss-Newton approximation that scales linearly, and that requires only first-order inner-product per iteration. T hird, we explore different linear system solvers, including the MINRES and conjugate gradient methods for approximately solving the linear systems. The innovations make predictor-corrector efficient for large networks and datasets. Experiments on a fair misinformation detection task show that 1) the predictor-corrector method can find Pareto fronts better than or similar to SMGD with less time, and 2) the proposed first-order method does not harm the quality of the Pareto front identified by the second-order method, while further reducing running time.
more »
« less
Low-Rank Univariate Sum of Squares Has No Spurious Local Minima
We study the problem of decomposing a polynomial p into a sum of r squares by minimizing a quadratically penalized objective fp(u)=‖‖∑ri=1u2i−p‖‖2. This objective is nonconvex and is equivalent to the rank-r Burer-Monteiro factorization of a semidefinite program (SDP) encoding the sum of squares decomposition. We show that for all univariate polynomials p, if r≥2 then fp(u) has no spurious second-order critical points, showing that all local optima are also global optima. This is in contrast to previous work showing that for general SDPs, in addition to genericity conditions, r has to be roughly the square root of the number of constraints (the degree of p) for there to be no spurious second-order critical points. Our proof uses tools from computational algebraic geometry and can be interpreted as constructing a certificate using the first- and second-order necessary conditions. We also show that by choosing a norm based on sampling equally-spaced points on the circle, the gradient ∇fp can be computed in nearly linear time using fast Fourier transforms. Experimentally we demonstrate that this method has very fast convergence using first-order optimization algorithms such as L-BFGS, with near-linear scaling to million-degree polynomials.
more »
« less
- Award ID(s):
- 1835443
- PAR ID:
- 10490265
- Publisher / Repository:
- SIAM
- Date Published:
- Journal Name:
- SIAM Journal on Optimization
- Volume:
- 33
- Issue:
- 3
- ISSN:
- 1052-6234
- Page Range / eLocation ID:
- 2041 to 2061
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study a deep learning inspired formulation for the blind demodulation problem, which is the task of recovering two unknown vectors from their entrywise multiplication. We consider the case where the unknown vectors are in the range of known deep generative models, G(1):R^n→R^l and G(2):R^p→R^l. In the case when the networks corresponding to the generative models are expansive, the weight matrices are random and the dimension of the unknown vectors satisfy l=Omega(n^2+p^2), up to log factors, we show that the empirical risk objective has a favorable landscape for optimization. That is, the objective function has a descent direction at every point outside of a small neighborhood around four hyperbolic curves. We also characterize the local maximizers of the empirical risk objective and, hence, show that there does not exist any other stationary points outside of these neighborhood around four hyperbolic curves and the set of local maximizers. We also implement a gradient descent scheme inspired by the geometry of the landscape of the objective function. In order to converge to a global minimizer, this gradient descent scheme exploits the fact that exactly one of the hyperbolic curve corresponds to the global minimizer, and thus points near this hyperbolic curve have a lower objective value than points close to the other spurious hyperbolic curves. We show that this gradient descent scheme can effectively remove distortions synthetically introduced to the MNIST dataset.more » « less
-
This paper addresses the problem of identification of error in variables switched linear models from experimental input/output data. This problem is known to be generically NP hard and thus computationally expensive to solve. To address this difficulty, several relaxations have been proposed in the past few years. While solvable in polynomial time these (convex) relaxations tend to scale poorly with the number of points and number/order of the subsystems, effectively limiting their applicability to scenarios with relatively small number of data points. To address this difficulty, in this paper we propose an efficient method that only requires performing (number of subsystems) singular value decompositions of matrices whose size is independent of the number of points. The underlying idea is to obtain a sum-of-squares polynomial approximation of the support of each subsystem one-at-a-time, and use these polynomials to segment the data into sets, each generated by a single subsystem. As shown in the paper, exploiting ideas from Christoffel's functions allows for finding these polynomial approximations simply by performing SVDs. The parameters of each subsystem can then be identified from the segmented data using existing error-in-variables (EIV) techniques.more » « less
-
Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, these techniques can also find a first-order stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a second-order stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an \eps-second-order stationary point using only O(n^{2/3}/\eps^2+n/\eps^{1.5}) stochastic gradients. To our best knowledge, this is the first second-order guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for finding \eps-first-order stationary points.more » « less
-
We propose the use of PolyPIC transfers [10] to construct a second order accurate discretization of the Navier-Stokes equations within a particle-in-cell framework on MAC grids. We investigate the accuracy of both APIC [16], [17], [8] and quadratic PolyPIC [10] transfers and demonstrate that they are suitable for constructing schemes converging with orders of approximately 1.5 and 2.5 respectively. We combine PolyPIC transfers with BDF-2 time integration and a splitting scheme for pressure and viscosity and demonstrate that the resulting scheme is second order accurate. Prior high order particle-in-cell schemes interpolate accelerations (not velocities) from the grid to particles and rely on moving least squares to transfer particle velocities to the computational grid. The proposed method instead transfers velocities to particles, which avoids the accumulation of noise on particle velocities but requires the polynomial reconstruction to be performed using polynomials that are one degree higher. Since this polynomial reconstruction occurs over the regular grid (rather than irregularly distributed particles), the resulting weighted least squares problem has a fixed sparse structure, can be solved efficiently in closed form, and is independent of particle coverage.more » « less