skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit
Abstract Many machine learning models based on neural networks exhibit scaling laws: their performance scales as power laws with respect to the sizes of the model and training data set. We use large-N field theory methods to solve a model recently proposed by Maloney, Roberts and Sully which provides a simplified setting to study neural scaling laws. Our solution extends the result in this latter paper to general nonzero values of the ridge parameter, which are essential to regularize the behavior of the model. In addition to obtaining new and more precise scaling laws, we also uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes. The same duality underlies recent efforts to design neural networks to simulate quantum field theories.  more » « less
Award ID(s):
2412880
PAR ID:
10580736
Author(s) / Creator(s):
Publisher / Repository:
IOP Publishing
Date Published:
Journal Name:
Machine Learning: Science and Technology
ISSN:
2632-2153
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When training deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size. Perhaps the best known example of such scaling laws are for transformer-based large language models (**LLMs**), where networks with billions of parameters are trained on trillions of tokens of text. Yet, despite sustained widespread interest, a rigorous understanding of why transformer scaling laws exist is still missing. To answer this question, we establish novel statistical estimation and mathematical approximation theories for transformers when the input data are concentrated on a low-dimensional manifold. Our theory predicts a power law between the generalization error and both the training data size and the network size for transformers, where the power depends on the intrinsic dimension d of the training data. Notably, the constructed model architecture is shallow, requiring only logarithmic depth in d. By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry. Moreover, we test our theory with empirical observation by training LLMs on natural language datasets. We find the observed empirical scaling laws closely agree with our theoretical predictions. Taken together, these results rigorously show the intrinsic dimension of data to be a crucial quantity affecting transformer scaling laws in both theory and practice. 
    more » « less
  2. Neural network-based emulators for the inference of stellar parameters and elemental abundances represent an increasingly popular methodology in modern spectroscopic surveys. However, these approaches are often constrained by their emulation precision and domain transfer capabilities. Greater generalizability has previously been achieved only with significantly larger model architectures, as demonstrated by Transformer-based models in natural language processing. This observation aligns with neural scaling laws, where model performance predictably improves with increased model size, computational resources allocated to model training, and training data volume. In this study, we demonstrate that these scaling laws also apply to Transformer-based spectral emulators in astronomy. Building upon our previous work with TransformerPayne and incorporating Maximum Update Parametrization techniques from natural language models, we provide training guidelines for scaling models to achieve optimal performance. Our results show that within the explored parameter space, clear scaling relationships emerge. These findings suggest that optimal computational resource allocation requires balanced scaling. Specifically, given a tenfold increase in training compute, achieving an optimal seven-fold reduction in mean squared error necessitates an approximately 2.5-fold increase in dataset size and a 3.8-fold increase in model size. This study establishes a foundation for developing spectral foundational models with enhanced domain transfer capabilities. 
    more » « less
  3. As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. 
    more » « less
  4. As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, ``un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2. 
    more » « less
  5. Farhat, C (Ed.)
    Abstract We present a machine learning framework capable of consistently inferring mathematical expressions of hyperelastic energy functionals for incompressible materials from sparse experimental data and physical laws. To achieve this goal, we propose a polyconvex neural additive model (PNAM) that enables us to express the hyperelastic model in a learnable feature space while enforcing polyconvexity. An upshot of this feature space obtained via the PNAM is that (1) it is spanned by a set of univariate basis functions that can be re‐parametrized with a more complex mathematical form, and (2) the resultant elasticity model is guaranteed to fulfill the polyconvexity, which ensures that the acoustic tensor remains elliptic for any deformation. To further improve the interpretability, we use genetic programming to convert each univariate basis into a compact mathematical expression. The resultant multi‐variable mathematical models obtained from this proposed framework are not only more interpretable but are also proven to fulfill physical laws. By controlling the compactness of the learned symbolic form, the machine learning‐generated mathematical model also requires fewer arithmetic operations than its deep neural network counterparts during deployment. This latter attribute is crucial for scaling large‐scale simulations where the constitutive responses of every integration point must be updated within each incremental time step. We compare our proposed model discovery framework against other state‐of‐the‐art alternatives to assess the robustness and efficiency of the training algorithms and examine the trade‐off between interpretability, accuracy, and precision of the learned symbolic hyperelastic models obtained from different approaches. Our numerical results suggest that our approach extrapolates well outside the training data regime due to the precise incorporation of physics‐based knowledge. 
    more » « less