skip to main content


Title: On the Convergence of Continuous Constrained Optimization for Structure Learning
Recently, structure learning of directed acyclic graphs (DAGs) has been formulated as a continuous optimization problem by leveraging an algebraic characterization of acyclicity. The constrained problem is solved using the augmented Lagrangian method (ALM) which is often preferred to the quadratic penalty method (QPM) by virtue of its standard convergence result that does not require the penalty coefficient to go to infinity, hence avoiding ill-conditioning. However, the convergence properties of these methods for structure learning, including whether they are guaranteed to return a DAG solution, remain unclear, which might limit their practical applications. In this work, we examine the convergence of ALM and QPM for structure learning in the linear, nonlinear, and confounded cases. We show that the standard convergence result of ALM does not hold in these settings, and demonstrate empirically that its behavior is akin to that of the QPM which is prone to ill-conditioning. We further establish the convergence guarantee of QPM to a DAG solution, under mild conditions. Lastly, we connect our theoretical results with existing approaches to help resolve the convergence issue, and verify our findings in light of an empirical comparison of them.  more » « less
Award ID(s):
2134901
NSF-PAR ID:
10380982
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Camps-Valls, Gustau; Ruiz, Francisco J.; Valera, Isabel
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
2022
ISSN:
2640-3498
Page Range / eLocation ID:
8176 - 8198
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent low-rank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical ill-conditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CP-ALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CP-ALS subproblems efficiently, have the same complexity as the standard CP-ALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when ill-conditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error. 
    more » « less
  2. Deep neural networks (DNNs) have shown their success as high-dimensional function approximators in many applications; however, training DNNs can be challenging in general. DNN training is commonly phrased as a stochastic optimization problem whose challenges include non-convexity, non-smoothness, insufficient regularization, and complicated data distributions. Hence, the performance of DNNs on a given task depends crucially on tuning hyperparameters, especially learning rates and regularization parameters. In the absence of theoretical guidelines or prior experience on similar tasks, this requires solving many training problems, which can be time-consuming and demanding on computational resources. This can limit the applicability of DNNs to problems with non-standard, complex, and scarce datasets, e.g., those arising in many scientific applications. To remedy the challenges of DNN training, we propose slimTrain, a stochastic optimization method for training DNNs with reduced sensitivity to the choice hyperparameters and fast initial convergence. The central idea of slimTrain is to exploit the separability inherent in many DNN architectures; that is, we separate the DNN into a nonlinear feature extractor followed by a linear model. This separability allows us to leverage recent advances made for solving large-scale, linear, ill-posed inverse problems. Crucially, for the linear weights, slimTrain does not require a learning rate and automatically adapts the regularization parameter. Since our method operates on mini-batches, its computational overhead per iteration is modest. In our numerical experiments, slimTrain outperforms existing DNN training methods with the recommended hyperparameter settings and reduces the sensitivity of DNN training to the remaining hyperparameters. 
    more » « less
  3. Abstract A parameter identification inverse problem in the form of nonlinear least squares is considered.In the lack of stability, the frozen iteratively regularized Gauss–Newton (FIRGN) algorithm is proposed and its convergence is justified under what we call a generalized normal solvability condition.The penalty term is constructed based on a semi-norm generated by a linear operator yielding a greater flexibility in the use of qualitative and quantitative a priori information available for each particular model.Unlike previously known theoretical results on the FIRGN method, our convergence analysis does not rely on any nonlinearity conditions and it is applicable to a large class of nonlinear operators.In our study, we leverage the nature of ill-posedness in order to establish convergence in the noise-free case.For noise contaminated data, we show that, at least theoretically, the process does not require a stopping rule and is no longer semi-convergent.Numerical simulations for a parameter estimation problem in epidemiology illustrate the efficiency of the algorithm. 
    more » « less
  4. In practical instances of nonconvex matrix factorization, the rank of the true solution r^{\star} is often unknown, so the rank rof the model can be over-specified as r>r^{\star}. This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with r=r^{\star} to a sublinear rate when r>r^{\star}. We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by \ell_{2} regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime. 
    more » « less
  5. SUMMARY Physics-based simulations provide a path to overcome the lack of observational data hampering a holistic understanding of earthquake faulting and crustal deformation across the vastly varying space–time scales governing the seismic cycle. However, simulations of sequences of earthquakes and aseismic slip (SEAS) including the complex geometries and heterogeneities of the subsurface are challenging. We present a symmetric interior penalty discontinuous Galerkin (SIPG) method to perform SEAS simulations accounting for the aforementioned challenges. Due to the discontinuous nature of the approximation, the spatial discretization natively provides a means to impose boundary and interface conditions. The method accommodates 2-D and 3-D domains, is of arbitrary order, handles subelement variations in material properties and supports isoparametric elements, that is, high-order representations of the exterior boundaries, interior material interfaces and embedded faults. We provide an open-source reference implementation, Tandem, that utilizes highly efficient kernels for evaluating the SIPG linear and bilinear forms, is inherently parallel and well suited to perform high-resolution simulations on large-scale distributed memory architectures. Additional flexibility and efficiency is provided by optionally defining the displacement evaluation via a discrete Green’s function approach, exploiting advantages of both the boundary integral and volumetric methods. The optional discrete Green’s functions are evaluated once in a pre-computation stage using algorithmically optimal and scalable sparse parallel solvers and pre-conditioners. We illustrate the characteristics of the SIPG formulation via an extensive suite of verification problems (analytic, manufactured and code comparison) for elastostatic and quasi-dynamic problems. Our verification suite demonstrates that high-order convergence of the discrete solution can be achieved in space and time and highlights the benefits of using a high-order representation of the displacement, material properties and geometries. We apply Tandem to realistic demonstration models consisting of a 2-D SEAS multifault scenario on a shallowly dipping normal fault with four curved splay faults, and a 3-D intersecting multifault scenario of elastostatic instantaneous displacement of the 2019 Ridgecrest, CA, earthquake sequence. We exploit the curvilinear geometry representation in both application examples and elucidate the importance of accurate stress (or displacement gradient) representation on-fault. This study entails several methodological novelties. We derive a sharp bound on the smallest value of the SIPG penalty ensuring stability for isotropic, elastic materials; define a new flux to incorporate embedded faults in a standard SIPG scheme; employ a hybrid multilevel pre-conditioner for the discrete elasticity problem; and demonstrate that curvilinear elements are specifically beneficial for volumetric SEAS simulations. We show that our method can be applied for solving interesting geophysical problems using massively parallel computing. Finally, this is the first time a discontinuous Galerkin method is published for the numerical simulations of SEAS, opening new avenues to pursue extreme scale 3-D SEAS simulations in the future. 
    more » « less