skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On Convergence and Generalization of Dropout Training
We study dropout in two-layer neural networks with rectified linear unit (ReLU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that dropout training with logistic loss achieves $$\epsilon$$-suboptimality in test error in $$O(1/\epsilon)$$ iterations.  more » « less
Award ID(s):
1943251
PAR ID:
10213691
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a distribution-dependent regularizer that equals the weighted trace-norm of the product of the factors. In deep learning, we show that the distribution-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks. 
    more » « less
  2. We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a distribution-dependent regularizer that equals the weighted trace-norm of the product of the factors. In deep learning, we show that the distribution-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks. 
    more » « less
  3. We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $$n$$-node undirected graph. We provide a randomized algorithm that, with $$O(n\epsilon^{-2})$$ queries to a degree and neighbor oracle and in $$O(n\epsilon^{-3})$$ time, estimates the spectrum up to $$\epsilon$$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $$O(n\epsilon^{-7})$$ time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small $$\epsilon$$, a $$2^{O(\epsilon^{-1})}$$ time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call \emph{nuclear sparsification}. We provide an $$O(n\epsilon^{-2})$$-query and $$O(n\epsilon^{-2})$$-time algorithm for computing $$O(n\epsilon^{-2})$$-sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first \emph{deterministic} algorithm for spectral density estimation that scales linearly with $$n$$ (sublinear in the representation size of the graph). 
    more » « less
  4. We determine the exact minimax rate of a Gaussian sequence model under bounded convex constraints, purely in terms of the local geometry of the given constraint set $$K$$. Our main result shows that the minimax risk (up to constant factors) under the squared $$L_2$$ loss is given by $$\epsilon^{*2} \wedge \operatorname{diam}(K)^2$$ with \begin{align*} \epsilon^* = \sup \bigg\{\epsilon : \frac{\epsilon^2}{\sigma^2} \leq \log M^{\operatorname{loc}}(\epsilon)\bigg\}, \end{align*} where $$\log M^{\operatorname{loc}}(\epsilon)$$ denotes the local entropy of the set $$K$$, and $$\sigma^2$$ is the variance of the noise. We utilize our abstract result to re-derive known minimax rates for some special sets $$K$$ such as hyperrectangles, ellipses, and more generally quadratically convex orthosymmetric sets. Finally, we extend our results to the unbounded case with known $$\sigma^2$$ to show that the minimax rate in that case is $$\epsilon^{*2}$$. 
    more » « less
  5. Random dropout has become a standard regularization technique in artificial neural networks (ANNs), but it is currently unknown whether an analogous mechanism exists in biological neural networks (BioNNs). If it does, its structure is likely to be optimized by hundreds of millions of years of evolution, which may suggest novel dropout strategies in large-scale ANNs. We propose that the brain serotonergic fibers (axons) meet some of the expected criteria because of their ubiquitous presence, stochastic structure, and ability to grow throughout the individual’s lifespan. Since the trajectories of serotonergic fibers can be modeled as paths of anomalous diffusion processes, in this proof-of-concept study we investigated a dropout algorithm based on the superdiffusive fractional Brownian motion (FBM). The results demonstrate that serotonergic fibers can potentially implement a dropout-like mechanism in brain tissue, supporting neuroplasticity. They also suggest that mathematical theories of the structure and dynamics of serotonergic fibers can contribute to the design of dropout algorithms in ANNs. 
    more » « less