Abstract A parameter identification inverse problem in the form of nonlinear least squares is considered.In the lack of stability, the frozen iteratively regularized Gauss–Newton (FIRGN) algorithm is proposed and its convergence is justified under what we call a generalized normal solvability condition.The penalty term is constructed based on a semi-norm generated by a linear operator yielding a greater flexibility in the use of qualitative and quantitative a priori information available for each particular model.Unlike previously known theoretical results on the FIRGN method, our convergence analysis does not rely on any nonlinearity conditions and it is applicable to a large class of nonlinear operators.In our study, we leverage the nature of ill-posedness in order to establish convergence in the noise-free case.For noise contaminated data, we show that, at least theoretically, the process does not require a stopping rule and is no longer semi-convergent.Numerical simulations for a parameter estimation problem in epidemiology illustrate the efficiency of the algorithm.
more »
« less
On the Convergence of Continuous Constrained Optimization for Structure Learning
Recently, structure learning of directed acyclic graphs (DAGs) has been formulated as a continuous optimization problem by leveraging an algebraic characterization of acyclicity. The constrained problem is solved using the augmented Lagrangian method (ALM) which is often preferred to the quadratic penalty method (QPM) by virtue of its standard convergence result that does not require the penalty coefficient to go to infinity, hence avoiding ill-conditioning. However, the convergence properties of these methods for structure learning, including whether they are guaranteed to return a DAG solution, remain unclear, which might limit their practical applications. In this work, we examine the convergence of ALM and QPM for structure learning in the linear, nonlinear, and confounded cases. We show that the standard convergence result of ALM does not hold in these settings, and demonstrate empirically that its behavior is akin to that of the QPM which is prone to ill-conditioning. We further establish the convergence guarantee of QPM to a DAG solution, under mild conditions. Lastly, we connect our theoretical results with existing approaches to help resolve the convergence issue, and verify our findings in light of an empirical comparison of them.
more »
« less
- Award ID(s):
- 2134901
- PAR ID:
- 10380982
- Editor(s):
- Camps-Valls, Gustau; Ruiz, Francisco J.; Valera, Isabel
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 2022
- ISSN:
- 2640-3498
- Page Range / eLocation ID:
- 8176 - 8198
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent low-rank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical ill-conditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CP-ALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CP-ALS subproblems efficiently, have the same complexity as the standard CP-ALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when ill-conditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error.more » « less
-
This paper introduces LSEMINK, an effective modified Newton–Krylov algorithm geared toward minimizing the log-sum-exp function for a linear model. Problems of this kind arise commonly, for example, in geometric programming and multinomial logistic regression. Although the log-sum-exp function is smooth and convex, standard line-search Newton-type methods can become inefficient because the quadratic approximation of the objective function can be unbounded from below. To circumvent this, LSEMINK modifies the Hessian by adding a shift in the row space of the linear model. We show that the shift renders the quadratic approximation to be bounded from below and that the overall scheme converges to a global minimizer under mild assumptions. Our convergence proof also shows that all iterates are in the row space of the linear model, which can be attractive when the model parameters do not have an intuitive meaning, as is common in machine learning. Since LSEMINK uses a Krylov subspace method to compute the search direction, it only requires matrix-vector products with the linear model, which is critical for large-scale problems. Our numerical experiments on image classification and geometric programming illustrate that LSEMINK considerably reduces the time-to-solution and increases the scalability compared to geometric programming and natural gradient descent approaches. It has significantly faster initial convergence than standard Newton–Krylov methods, which is particularly attractive in applications like machine learning. In addition, LSEMINK is more robust to ill-conditioning arising from the nonsmoothness of the problem. We share our MATLAB implementation at a GitHub repository (https://github.com/KelvinKan/LSEMINK).more » « less
-
Bayesian Networks (BNs) represent conditional probability relations among a set of random variables (nodes) in the form of a directed acyclic graph (DAG), and have found diverse applications in knowledge discovery. We study the problem of learning the sparse DAG structure of a BN from continuous observational data. The central problem can be modeled as a mixed-integer program with an objective function composed of a convex quadratic loss function and a regularization penalty subject to linear constraints. The optimal solution to this mathematical program is known to have desirable statistical properties under certain conditions. However, the state-of-the-art optimization solvers are not able to obtain provably optimal solutions to the existing mathematical formulations for medium-size problems within reasonable computational times. To address this difficulty, we tackle the problem from both computational and statistical perspectives. On the one hand, we propose a concrete early stopping criterion to terminate the branch-and-bound process in order to obtain a near-optimal solution to the mixed-integer program, and establish the consistency of this approximate solution. On the other hand, we improve the existing formulations by replacing the linear “big-M " constraints that represent the relationship between the continuous and binary indicator variables with second-order conic constraints. Our numerical results demonstrate the effectiveness of the proposed approaches.more » « less
-
Abstract In this article, we exploit the similarities between Tikhonov regularization and Bayesian hierarchical models to propose a regularization scheme that acts like a distributed Tikhonov regularization where the amount of regularization varies from component to component, and a computationally efficient numerical scheme that is suitable for large-scale problems. In the standard formulation, Tikhonov regularization compensates for the inherent ill-conditioning of linear inverse problems by augmenting the data fidelity term measuring the mismatch between the data and the model output with a scaled penalty functional. The selection of the scaling of the penalty functional is the core problem in Tikhonov regularization. If an estimate of the amount of noise in the data is available, a popular way is to use the Morozov discrepancy principle, stating that the scaling parameter should be chosen so as to guarantee that the norm of the data fitting error is approximately equal to the norm of the noise in the data. A too small value of the regularization parameter would yield a solution that fits to the noise (too weak regularization) while a too large value would lead to an excessive penalization of the solution (too strong regularization). In many applications, it would be preferable to apply distributed regularization, replacing the regularization scalar by a vector valued parameter, so as to allow different regularization for different components of the unknown, or for groups of them. Distributed Tikhonov-inspired regularization is particularly well suited when the data have significantly different sensitivity to different components, or to promote sparsity of the solution. The numerical scheme that we propose, while exploiting the Bayesian interpretation of the inverse problem and identifying the Tikhonov regularization with the maximum a posteriori estimation, requires no statistical tools. A clever combination of numerical linear algebra and numerical optimization tools makes the scheme computationally efficient and suitable for problems where the matrix is not explicitly available. Moreover, in the case of underdetermined problems, passing through the adjoint formulation in data space may lead to substantial reduction in computational complexity.more » « less
An official website of the United States government

