Abstract We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the fluctuations of the dynamical mean field theory order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. We find that once the mean field/µP parameterization is adopted, the leading finite size effect on the dynamics is to introduce initialization variance in the predictions and feature kernels of the networks. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For convolutional neural networks trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.
more »
« less
This content will become publicly available on August 7, 2026
Feature learning and generalization in deep networks with orthogonal weights
Abstract Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width—which govern the evolution of observables during training—saturate at a depth of , rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep non-linear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.
more »
« less
- Award ID(s):
- 2019786
- PAR ID:
- 10648141
- Publisher / Repository:
- IOP Publishing
- Date Published:
- Journal Name:
- Machine Learning: Science and Technology
- Volume:
- 6
- Issue:
- 3
- ISSN:
- 2632-2153
- Page Range / eLocation ID:
- 035027
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Variational approaches are among the most powerful techniques toapproximately solve quantum many-body problems. These encompass bothvariational states based on tensor or neural networks, and parameterizedquantum circuits in variational quantum eigensolvers. However,self-consistent evaluation of the quality of variational wavefunctionsis a notoriously hard task. Using a recently developed Hamiltonianreconstruction method, we propose a multi-faceted approach to evaluatingthe quality of neural-network based wavefunctions. Specifically, weconsider convolutional neural network (CNN) and restricted Boltzmannmachine (RBM) states trained on a square latticespin-1/2 J_1\!-\!J_2 Heisenberg model. We find that the reconstructed Hamiltonians aretypically less frustrated, and have easy-axis anisotropy near the highfrustration point. In addition, the reconstructed Hamiltonians suppressquantum fluctuations in the largeJ_2 limit. Our results highlight the critical importance of thewavefunction’s symmetry. Moreover, the multi-faceted insight from theHamiltonian reconstruction reveals that a variational wave function canfail to capture the true ground state through suppression of quantumfluctuations.more » « less
-
Abstract A measurement of off-shell Higgs boson production in the decay channel is presented. The measurement uses 140 fb−1of proton–proton collisions at TeV collected by the ATLAS detector at the Large Hadron Collider and supersedes the previous result in this decay channel using the same dataset. The data analysis is performed using a neural simulation-based inference method, which builds per-event likelihood ratios using neural networks. The observed (expected) off-shell Higgs boson production signal strength in the decay channel at 68% CL is ( ). The evidence for off-shell Higgs boson production using the decay channel has an observed (expected) significance of 2.5σ(1.3σ). The expected result represents a significant improvement relative to that of the previous analysis of the same dataset, which obtained an expected significance of 0.5σ. When combined with the most recent ATLAS measurement in the decay channel, the evidence for off-shell Higgs boson production has an observed (expected) significance of 3.7σ(2.4σ). The off-shell measurements are combined with the measurement of on-shell Higgs boson production to obtain constraints on the Higgs boson total width. The observed (expected) value of the Higgs boson width at 68% CL is ( ) MeV.more » « less
-
Abstract We continue earlier efforts in computing the dimensions of tangent space cohomologies of Calabi–Yau manifolds using deep learning. In this paper, we consider the dataset of all Calabi–Yau four-folds constructed as complete intersections in products of projective spaces. Employing neural networks inspired by state-of-the-art computer vision architectures, we improve earlier benchmarks and demonstrate that all four non-trivial Hodge numbers can be learned at the same time using a multi-task architecture. With 30% (80%) training ratio, we reach an accuracy of 100% for and 97% for (100% for both), 81% (96%) for , and 49% (83%) for . Assuming that the Euler number is known, as it is easy to compute, and taking into account the linear constraint arising from index computations, we get 100% total accuracy.more » « less
-
Abstract The hot plasma in galaxy clusters, the intracluster medium, is expected to be shaped by subsonic turbulent motions, which are key for heating, cooling, and transport mechanisms. The turbulent motions contribute to the nonthermal pressure, which, if not accounted for, consequently imparts a hydrostatic mass bias. Accessing information about turbulent motions is thus of major astrophysical and cosmological interest. Characteristics of turbulent motions can be indirectly accessed through surface brightness fluctuations. This study expands on our pilot investigations of surface brightness fluctuations in the Sunyaev–Zel’dovich and in X-ray data by examining, for the first time, a large sample of 60 clusters using both SPT-SZ and XMM-Newton data and spans the redshift range 0.2 < z < 1.5, thus constraining the respective pressure and density fluctuations within 0.6R500. We deem density fluctuations to be of sufficient quality for 32 clusters, finding mild correlations between the peak of the amplitude spectra of density fluctuations and various dynamical parameters. We infer turbulent velocities from density fluctuations with an average Mach number , in agreement with numerical simulations. For clusters with inferred turbulent Mach numbers from fluctuations in both pressure, , and density, , we find broad agreement between and . Our results suggest either a bimodal or a skewed unimodal Mach number distribution, with the majority of clusters being turbulence-dominated (subsonic) while the remainder are shock-dominated (supersonic).more » « less
An official website of the United States government
