 Award ID(s):
 2107189
 NSFPAR ID:
 10336943
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

It has become standard to solve NLP tasks by finetuning pretrained language models (LMs), especially in lowdata settings. There is minimal theoretical understanding of empirical success, e.g., why finetuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes finetuning of pretrained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe finetuning updates to pretrained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernelbased dynamics during finetuning. Finally, we use this kernel view to propose an explanation for the success of parameterefficient subspacebased finetuning methods.more » « less

Motivated by both theory and practice, we study how random pruning of the weights affects a neural network's neural tangent kernel (NTK). In particular, this work establishes an equivalence of the NTKs between a fullyconnected neural network and its randomly pruned version. The equivalence is established under two cases. The first main result studies the infinitewidth asymptotic. It is shown that given a pruning probability, for fullyconnected neural networks with the weights randomly pruned at the initialization, as the width of each layer grows to infinity sequentially, the NTK of the pruned neural network converges to the limiting NTK of the original network with some extra scaling. If the network weights are rescaled appropriately after pruning, this extra scaling can be removed. The second main result considers the finitewidth case. It is shown that to ensure the NTK's closeness to the limit, the dependence of width on the sparsity parameter is asymptotically linear, as the NTK's gap to its limit goes down to zero. Moreover, if the pruning probability is set to zero (i.e., no pruning), the bound on the required width matches the bound for fullyconnected neural networks in previous works up to logarithmic factors. The proof of this result requires developing a novel analysis of a network structure which we called maskinduced pseudonetworks. Experiments are provided to evaluate our results.more » « less

Abstract We study infinite limits of neural network quantum states (
NNQS), which exhibit representation power through ensemble statistics, and also tractable gradient descent dynamics. Ensemble averages of entanglement entropies are expressed in terms of neural network correlators, and architectures that exhibit volumelaw entanglement are presented. The analytic calculations of entanglement entropy bound are tractable because the ensemble statistics are simplified in the Gaussian process limit. A general framework is developed for studying the gradient descent dynamics of neural network quantum states (NNQS), using a quantum state neural tangent kernel (QSNTK). For $\mathrm{\infty}$ NNQS the training dynamics is simplified, since the QSNTK becomes deterministic and constant. An analytic solution is derived for quantum state supervised learning, which allows an $\mathrm{\infty}$ NNQS to recover any target wavefunction. Numerical experiments on finite and infinite NNQS in the transverse field Ising model and Fermi Hubbard model demonstrate excellent agreement with theory. $\mathrm{\infty}$ NNQS opens up new opportunities for studying entanglement and training dynamics in other physics applications, such as in finding ground states. $\mathrm{\infty}$ 
Meila, Marina ; Zhang, Tong (Ed.)Federated Learning (FL) is an emerging learning scheme that allows different distributed clients to train deep neural networks together without data sharing. Neural networks have become popular due to their unprecedented success. To the best of our knowledge, the theoretical guarantees of FL concerning neural networks with explicit forms and multistep updates are unexplored. Nevertheless, training analysis of neural networks in FL is nontrivial for two reasons: first, the objective loss function we are optimizing is nonsmooth and nonconvex, and second, we are even not updating in the gradient direction. Existing convergence results for gradient descentbased methods heavily rely on the fact that the gradient direction is used for updating. The current paper presents a new class of convergence analysis for FL, Federated Neural Tangent Kernel (FLNTK), which corresponds to overparamterized ReLU neural networks trained by gradient descent in FL and is inspired by the analysis in Neural Tangent Kernel (NTK). Theoretically, FLNTK converges to a globaloptimal solution at a linear rate with properly tuned learning parameters. Furthermore, with proper distributional assumptions, FLNTK can also achieve good generalization. The proposed theoretical analysis scheme can be generalized to more complex neural networks.more » « less

null (Ed.)Federated Learning (FL) is an emerging learning scheme that allows different distributed clients to train deep neural networks together without data sharing. Neural networks have become popular due to their unprecedented success. To the best of our knowledge, the theoretical guarantees of FL concerning neural networks with explicit forms and multistep updates are unexplored. Nevertheless, training analysis of neural networks in FL is nontrivial for two reasons: first, the objective loss function we are optimizing is nonsmooth and nonconvex, and second, we are even not updating in the gradient direction. Existing convergence results for gradient descentbased methods heavily rely on the fact that the gradient direction is used for updating. This paper presents a new class of convergence analysis for FL, Federated Learning Neural Tangent Kernel (FLNTK), which corresponds to overparamterized ReLU neural networks trained by gradient descent in FL and is inspired by the analysis in Neural Tangent Kernel (NTK). Theoretically, FLNTK converges to a globaloptimal solution at a linear rate with properly tuned learning parameters. Furthermore, with proper distributional assumptions, FLNTK can also achieve good generalization.more » « less