skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?
We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -- a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard weight decay is equivalent to promoting the ℓp-sparsity (0<1) of the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.  more » « less
Award ID(s):
2134214
PAR ID:
10467172
Author(s) / Creator(s):
;
Publisher / Repository:
International Conference on Learning Representation (ICLR-23)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Agrawal, Shipra; Roth, Aaron (Ed.)
    Tree-based methods are popular nonparametric tools for capturing spatial heterogeneity and making predictions in multivariate problems. In unsupervised learning, trees and their ensembles have also been applied to a wide range of statistical inference tasks, such as multi-resolution sketching of distributional variations, localization of high-density regions, and design of efficient data compression schemes. In this paper, we study the spatial adaptation property of Bayesian tree-based methods in the unsupervised setting, with a focus on the density estimation problem. We characterize spatial heterogeneity of the underlying density function by using anisotropic Besov spaces, region-wise anisotropic Besov spaces, and two novel function classes as their extensions. For two types of commonly used prior distributions on trees under the context of unsupervised learning—the optional P{ó}lya tree (Wong and Ma, 2010) and the Dirichlet prior (Lu et al., 2013)—we calculate posterior concentration rates when the density function exhibits different types of heterogeneity. In specific, we show that the posterior concentration rate for trees is near minimax over the anisotropic Besov space. The rate is adaptive in the sense that to achieve such a rate we do not need any prior knowledge of the parameters of the Besov space. 
    more » « less
  2. Agrawal, Shipra; Roth, Aaron (Ed.)
    Tree-based methods are popular nonparametric tools for capturing spatial heterogeneity and making predictions in multivariate problems. In unsupervised learning, trees and their ensembles have also been applied to a wide range of statistical inference tasks, such as multi-resolution sketching of distributional variations, localization of high-density regions, and design of efficient data compression schemes. In this paper, we study the spatial adaptation property of Bayesian tree-based methods in the unsupervised setting, with a focus on the density estimation problem. We characterize spatial heterogeneity of the underlying density function by using anisotropic Besov spaces, region-wise anisotropic Besov spaces, and two novel function classes as their extensions. For two types of commonly used prior distributions on trees under the context of unsupervised learning—the optional P{ó}lya tree (Wong and Ma, 2010) and the Dirichlet prior (Lu et al., 2013)—we calculate posterior concentration rates when the density function exhibits different types of heterogeneity. In specific, we show that the posterior concentration rate for trees is near minimax over the anisotropic Besov space. The rate is adaptive in the sense that to achieve such a rate we do not need any prior knowledge of the parameters of the Besov space. 
    more » « less
  3. We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality. 
    more » « less
  4. GaN-on-GaN vertical diode is a promising device for next-generation power electronics. Its breakdown voltage (BV) is limited by edge termination designs such as guard rings. The design space of guard rings is huge and it is difficult to optimize manually. In this paper, we propose an effective inverse design strategy to co-optimize BV and (V F Q) −1 , where BV, V F , and Q are the breakdown voltage, forward voltage, and reserve capacitive charge of the diode, respectively. Using rapid Technology Computer-Aided-Design (TCAD) simulations, neural network (NN), and Pareto front generation, a GaN-on-GaN diode is optimized within 24 hours. We can obtain structures with 200V higher BV at medium (V F Q) −1 or find a nearly ideal BV structure with 25% higher BV 2 /R on compared to the best randomly generated TCAD data. 
    more » « less
  5. We initiate the study of smoothed analysis for the sequential probability assignment problem with contexts. We study information-theoretically optimal minmax rates as well as a framework for algorithmic reduction involving the maximum likelihood estimator oracle. Our approach establishes a general-purpose reduction from minimax rates for sequential probability assignment for smoothed adversaries to minimax rates for transductive learning. This leads to optimal (logarithmic) fast rates for parametric classes and classes with finite VC dimension. On the algorithmic front, we develop an algorithm that efficiently taps into the MLE oracle, for general classes of functions. We show that under general conditions this algorithmic approach yields sublinear regret. 
    more » « less