NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-grade Deep Learning

https://doi.org/10.1007/s42967-024-00474-y

Xu, Yuesheng (March 2025, Communications on Applied Mathematics and Computation)

Abstract Deep learning requires solving a nonconvex optimization problem of a large size to learn a deep neural network (DNN). The current deep learning model is of asingle-grade, that is, it trains a DNN end-to-end, by solving a single nonconvex optimization problem. When the layer number of the neural network is large, it is computationally challenging to carry out such a task efficiently. The complexity of the task comes from learning all weight matrices and bias vectors from one single nonconvex optimization problem of a large size. Inspired by the human education process which arranges learning in grades, we propose a multi-grade learning model: instead of solving one single optimization problem of a large size, we successively solve a number of optimization problems of small sizes, which are organized in grades, to learn a shallow neural network (a network having a few hidden layers) for each grade. Specifically, the current grade is to learn the leftover from the previous grade. In each of the grades, we learn a shallow neural network stacked on the top of the neural network, learned in the previous grades, whose parameters remain unchanged in training of the current and future grades. By dividing the task of learning a DDN into learning several shallow neural networks, one can alleviate the severity of the nonconvexity of the original optimization problem of a large size. When all grades of the learning are completed, the final neural network learned is astair-shapeneural network, which is thesuperpositionof networks learned from all grades. Such a model enables us to learn a DDN much more effectively and efficiently. Moreover, multi-grade learning naturally leads to adaptive learning. We prove that in the context of function approximation if the neural network generated by a new grade is nontrivial, the optimal error of a new grade is strictly reduced from the optimal error of the previous grade. Furthermore, we provide numerical examples which confirm that the proposed multi-grade model outperforms significantly the standard single-grade model and is much more robust to noise than the single-grade model. They include three proof-of-concept examples, classification on two benchmark data sets MNIST and Fashion MNIST with two noise rates, which is to find classifiers, functions of 784 dimensions, and as well as numerical solutions of the one-dimensional Helmholtz equation.
more » « less
Inexact Fixed-Point Proximity Algorithm for the $$\ell _0$$ Sparse Regularization Problem

https://doi.org/10.1007/s10915-024-02600-7

Fang, Ronglong; Xu, Yuesheng; Yan, Mingsong (July 2024, Journal of Scientific Computing)

Abstract We studyinexactfixed-point proximity algorithms for solving a class of sparse regularization problems involving the$$\ell _0$$ $ℓ_{0}$ norm. Specifically, the$$\ell _0$$ $ℓ_{0}$ model has an objective function that is the sum of a convex fidelity term and a Moreau envelope of the$$\ell _0$$ $ℓ_{0}$ norm regularization term. Such an$$\ell _0$$ $ℓ_{0}$ model is non-convex. Existing exact algorithms for solving the problems require the availability of closed-form formulas for the proximity operator of convex functions involved in the objective function. When such formulas are not available, numerical computation of the proximity operator becomes inevitable. This leads to inexact iteration algorithms. We investigate in this paper how the numerical error for every step of the iteration should be controlled to ensure global convergence of the inexact algorithms. We establish a theoretical result that guarantees the sequence generated by the proposed inexact algorithm converges to a local minimizer of the optimization problem. We implement the proposed algorithms for three applications of practical importance in machine learning and image science, which include regression, classification, and image deblurring. The numerical results demonstrate the convergence of the proposed algorithm and confirm that local minimizers of the$$\ell _0$$ $ℓ_{0}$ models found by the proposed inexact algorithm outperform global minimizers of the corresponding$$\ell _1$$ $ℓ_{1}$ models, in terms of approximation accuracy and sparsity of the solutions.
more » « less
Addressing Spectral Bias of Deep Neural Networks by Multi-Grade Deep Learning

Fang, Ronglong; Xu, Yuesheng (December 2024, 38th Conference on Neural Information Processing Systems (NeurIPS 2024).)

Deep neural networks (DNNs) have showcased their remarkable precision in approximating smooth functions. However, they suffer from the spectral bias, wherein DNNs typically exhibit a tendency to prioritize the learning of lower-frequency components of a function, struggling to effectively capture its high-frequency features. This paper is to address this issue. Notice that a function having only low frequency components may be well-represented by a shallow neural network (SNN), a network having only a few layers. By observing that composition of low frequency functions can effectively approximate a high-frequency function, we propose to learn a function containing high-frequency components by composing several SNNs, each of which learns certain low-frequency information from the given data. We implement the proposed idea by exploiting the multi-grade deep learning (MGDL) model, a recently introduced model that trains a DNN incrementally, grade by grade, a current grade learning from the residue of the previous grade only an SNN (with trainable parameters) composed with the SNNs (with fixed parameters) trained in the preceding grades as features. We apply MGDL to synthetic, manifold, colored images, and MNIST datasets, all characterized by presence of high-frequency features. Our study reveals that MGDL excels at representing functions containing high-frequency information. Specifically, the neural networks learned in each grade adeptly capture some low-frequency information, allowing their compositions with SNNs learned in the previous grades effectively representing the high-frequency features. Our experimental results underscore the efficacy of MGDL in addressing the spectral bias inherent in DNNs. By leveraging MGDL, we offer insights into overcoming spectral bias limitation of DNNs, thereby enhancing the performance and applicability of deep learning models in tasks requiring the representation of high-frequency information. This study confirms that the proposed method offers a promising solution to address the spectral bias of DNNs. The code is available on GitHub: Addressing Spectral Bias via MGDL.
more » « less
Free, publicly-accessible full text available December 11, 2025
Uniform Convergence of Deep Neural Networks With Lipschitz Continuous Activation Functions and Variable Widths

https://doi.org/10.1109/TIT.2024.3439136

Xu, Yuesheng; Zhang, Haizhang (October 2024, IEEE Transactions on Information Theory)
A duality approach to regularized learning problems in Banach spaces

https://doi.org/10.1016/j.jco.2023.101818

Cheng, Raymond; Wang, Rui; Xu, Yuesheng (April 2024, Journal of Complexity)

Regularized learning problems in Banach spaces, which often minimize the sum of a data fidelity term in one Banach norm and a regularization term in another Banach norm, is challenging to solve. We construct a direct sum space based on the Banach spaces for the fidelity term and the regularization term and recast the objective function as the norm of a quotient space of the direct sum space. We then express the original regularized problem as an optimization problem in the dual space of the direct sum space. It is to find the maximum of a linear function on a convex polytope, which may be solved by linear programming. A solution of the original problem is then obtained by using related extremal properties of norming functionals from a solution of the dual problem. Numerical experiments demonstrate that the proposed duality approach is effective for solving the regularization learning problems.
more » « less
Full Text Available
DEEP NEURAL NETWORK SOLUTIONS FOR OSCILLATORY FREDHOLM INTEGRAL EQUATIONS

https://doi.org/10.1216/jie.2024.36.23

Jiang, Jie; Xu, Yuesheng (April 2024, Journal of Integral Equations and Applications)

We studied the use of deep neural networks (DNNs) in the numerical solution of the oscillatory Fredholm integral equation of the second kind. It is known that the solution of the equation exhibits certain oscillatory behaviors due to the oscillation of the kernel. It was pointed out recently that standard DNNs favor low frequency functions, and as a result, they often produce poor approximation for functions containing high frequency components. We addressed this issue in this study. We first developed a numerical method for solving the equation with DNNs as an approximate solution by designing a numerical quadrature that tailors to computing oscillatory integrals involving DNNs. We proved that the error of the DNN approximate solution of the equation is bounded by the training loss and the quadrature error. We then proposed a multigrade deep learning (MGDL) model to overcome the spectral bias issue of neural networks. Numerical experiments demonstrate that the MGDL model is effective in extracting multiscale information of the oscillatory solution and overcoming the spectral bias issue from which a standard DNN model suffers.
more » « less
Full Text Available
Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces

Wang, Rui; Xu, Yuesheng; Yan, Mingsong (February 2024, Journal of machine learning research)

Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs, the sequence space l_1(N) and the measure space, can have sparse representer theorems for both MNI and regularization models.
more » « less
Full Text Available
Convergence of deep ReLU networks

https://doi.org/10.1016/j.neucom.2023.127174

Xu, Yuesheng; Zhang, Haizhang (February 2024, Neurocomputing)

We explore convergence of deep neural networks with the popular ReLU activation function, as the depth of the networks tends to infinity. To this end, we introduce the notion of activation domains and activation matrices of a ReLU network. By replacing applications of the ReLU activation function by multiplications with activation matrices on activation domains, we obtain an explicit expression of the ReLU network. We then identify the convergence of the ReLU networks as convergence of a class of infinite products of matrices. Sufficient and necessary conditions for convergence of these infinite products of matrices are studied. As a result, we establish necessary conditions for ReLU networks to converge that the sequence of weight matrices converges to the identity matrix and the sequence of the bias vectors converges to zero as the depth of ReLU networks increases to infinity. Moreover, we obtain sufficient conditions in terms of the weight matrices and bias vectors at hidden layers for pointwise convergence of deep ReLU networks. These results provide mathematical insights to convergence of deep neural networks. Experiments are conducted to mathematically verify the results and to illustrate their potential usefulness in initialization of deep neural networks.
more » « less
Full Text Available
Sparse machine learning in Banach spaces

https://doi.org/10.1016/j.apnum.2023.02.011

Xu, Yuesheng (May 2023, Applied Numerical Mathematics)

Full Text Available
Parameter choices for sparse regularization with the ℓ1 norm ^*

https://doi.org/10.1088/1361-6420/acad22

Liu, Qianru; Wang, Rui; Xu, Yuesheng; Yan, Mingsong (January 2023, Inverse Problems)

Abstract We consider a regularization problem whose objective function consists of a convex fidelity term and a regularization term determined by the ℓ 1 norm composed with a linear transform. Empirical results show that the regularization with the ℓ 1 norm can promote sparsity of a regularized solution. The goal of this paper is to understand theoretically the effect of the regularization parameter on the sparsity of the regularized solutions. We establish a characterization of the sparsity under the transform matrix of the solution. When the objective function is block-separable or an error bound of the regularized solution to a known function is available, the resulting characterization can be taken as a regularization parameter choice strategy with which the regularization problem has a solution having a sparsity of a certain level. When the objective function is not block-separable, we propose an iterative algorithm which simultaneously determines the regularization parameter and its corresponding solution with a prescribed sparsity level. Moreover, we study choices of the regularization parameter so that the regularization term can alleviate the ill-posedness and promote sparsity of the resulting regularized solution. Numerical experiments demonstrate that the proposed algorithm is effective and efficient, and the choices of the regularization parameters can balance the sparsity of the regularized solution and its approximation to the minimizer of the fidelity function.
more » « less
Full Text Available

« Prev Next »

Search for: All records