 Award ID(s):
 1717916
 Publication Date:
 NSFPAR ID:
 10099024
 Journal Name:
 International Conference on Machine Learning
 Sponsoring Org:
 National Science Foundation
More Like this

Many modern learning tasks involve fitting nonlinear models which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by firstorder optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optimum. As part of our proof technique, we introduce a new potential function which captures the tradeoff between the loss function and the distance to the initial point as the iterations progress. The utility of our general theory is demonstrated for a variety of problemmore »

We develop a convex analytic approach to analyze finite width twolayer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rankone data. We also characterize the classification decision regions in terms of a kernel matrix and minimum `1norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as autoencoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cuttingplane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closedform formulas in some practically relevant special cases.

We propose a primaldual based framework for analyzing the global optimality of nonconvex lowrank matrix recovery. Our analysis are based on the restricted strongly convex and smooth conditions, which can be verified for a broad family of loss functions. In addition, our analytic framework can directly handle the widelyused incoherence constraints through the lens of duality. We illustrate the applicability of the proposed framework to matrix completion and onebit matrix completion, and prove that all these problems have no spurious local minima. Our results not only improve the sample complexity required for characterizing the global optimality of matrix completion, but also resolve an open problem in Ge et al. (2017) regarding onebit matrix completion. Numerical experiments show that primaldual based algorithm can successfully recover the global optimum for various lowrank problems.

We describe the convex semiinfinite dual of the twolayer vectoroutput ReLU neural network training problem. This semiinfinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the nonconvex neural network training problem is equivalent to a finitedimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via seminonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with softthresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and whichmore »

Abstract Topology optimization by optimally distributing materials in a given domain requires nongradient optimizers to solve highly complicated problems. However, with hundreds of design variables or more involved, solving such problems would require millions of Finite Element Method (FEM) calculations whose computational cost is huge and impractical. Here we report Selfdirected Online Learning Optimization (SOLO) which integrates Deep Neural Network (DNN) with FEM calculations. A DNN learns and substitutes the objective as a function of design variables. A small number of training data is generated dynamically based on the DNN’s prediction of the optimum. The DNN adapts to the new training data and gives better prediction in the region of interest until convergence. The optimum predicted by the DNN is proved to converge to the true global optimum through iterations. Our algorithm was tested by four types of problems including compliance minimization, fluidstructure optimization, heat transfer enhancement and truss optimization. It reduced the computational time by 2 ~ 5 orders of magnitude compared with directly using heuristic methods, and outperformed all stateoftheart algorithms tested in our experiments. This approach enables solving large multidimensional optimization problems.