Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime, where the networks' biases are initialized to some constant rather than zero. We prove that under such initialization, the neural network will have sparse activation throughout the entire training process, which enables fast training procedures via some sophisticated computational methods. With such initialization, we show that the neural networks possess a different limiting kernel which we call bias-generalized NTK, and we study various properties of the neural networks with this new kernel. We first characterize the gradient descent dynamics. In particular, we show that the network in this case can achieve as fast convergence as the dense network, as opposed to the previous work suggesting that the sparse networks converge slower. In addition, our result improves the previous required width to ensure convergence. Secondly, we study the networks' generalization: we show a width-sparsity dependence, which yields a sparsity-dependent Rademacher complexity and generalization bound. To our knowledge, this is the first sparsity-dependent generalization result via Rademacher complexity. Lastly, we study the smallest eigenvalue of this new kernel. We identify a data-dependent region where we can derive a much sharper lower bound on the NTK's smallest eigenvalue than the worst-case bound previously known. This can lead to improvement in the generalization bound.more » « lessFree, publicly-accessible full text available November 18, 2025
-
Free, publicly-accessible full text available May 1, 2025
-
Guruswami, Venkatesan (Ed.)We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width m is much larger than the data dimension d and the number of training samples n (m = poly(n,d)), which induces a prohibitive large weight matrix W ∈ ℝ^{m× m} per layer. Naively, one has to pay O(m²) time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses m² cost only in the initialization phase and achieves a truly subquadratic cost per iteration in terms of m, i.e., m^{2-Ω(1)} per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy large language models (LLM).more » « less
-
Abstract Breakdown voltage (BV) is arguably one of the most critical parameters for power devices. While avalanche breakdown is prevailing in silicon and silicon carbide devices, it is lacking in many wide bandgap (WBG) and ultra-wide bandgap (UWBG) devices, such as the gallium nitride high electron mobility transistor and existing UWBG devices, due to the deployment of junction-less device structures or the inherent material challenges of forming p-n junctions. This paper starts with a survey of avalanche and non-avalanche breakdown mechanisms in WBG and UWBG devices, followed by the distinction between the static and dynamic BV. Various BV characterization methods, including the static and pulse I – V sweep, unclamped and clamped inductive switching, as well as continuous overvoltage switching, are comparatively introduced. The device physics behind the time- and frequency-dependent BV as well as the enabling device structures for avalanche breakdown are also discussed. The paper concludes by identifying research gaps for understanding the breakdown of WBG and UWBG power devices.more » « less