NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Compute-Optimal LLMs Provably Generalize Better With Scale

Finzi, M; Kapoor, S; Granziol, D; Gu, A; De_Sa, C; Kolter, JZ; Wilson, AG (April 2025, International Conference on Learning Representations)

Free, publicly-accessible full text available April 24, 2026
Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Potapczynski, A; Qiu, S; Finzi, M; Ferri, C; Chen, Z; Goldblum, M; Bruss, B; De_Sa, C; Wilson, AG (December 2024, Advances in Neural Information Processing Systems)

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small ω (which measures parameter sharing) and large ψ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.
more » « less
Full Text Available
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Gruver, N; Sriram, A; Madotto, A; Wilson, AG; Zitnick, LC; Ulissi, Z (May 2024, International Conference on Learning Representations)

We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.
more » « less
Full Text Available
Should We Learn Most Likely Functions or Parameters?

Qiu, S; Rudner, T; Kapoor, S; Wilson, AG (December 2023, Advances in Neural Information Processing Systems)

Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting
more » « less
Full Text Available
Function-Space Regularization in Neural Networks

Rudner, T; Kapoor, S; Qiu, S; Wilson, AG (July 2023, International Conference on Machine Learning)

Parameter-space regularization in neural network optimization is a fundamental tool for improving generalization. However, standard parameter-space regularization methods make it challenging to encode explicit preferences about desired predictive functions into neural network training. In this work, we approach regularization in neural networks from a probabilistic perspective and show that by viewing parameter-space regularization as specifying an empirical prior distribution over the model parameters, we can derive a probabilistically well-motivated regularization technique that allows explicitly encoding information about desired predictive functions into neural network training. This method—which we refer to as function-space empirical Bayes (FS-EB)—includes both parameter- and function-space regularization, is mathematically simple, easy to implement, and incurs only minimal computational overhead compared to standard regularization techniques. We evaluate the utility of this regularization technique empirically and demonstrate that the proposed method leads to near-perfect semantic shift detection, highly-calibrated predictive uncertainty estimates, successful task adaption from pre-trained models, and improved generalization under covariate shift.
more » « less
Full Text Available
Protein Design with Guided Discrete Diffusion

Gruver, N; Stanton, S; Frey, N; Rudner, T; Hotzel, I; Lafrance-Vanasse, J; Rajpal, A; Cho, K; Wilson, AG (December 2023, Advances in Neural Information Processing Systems)

Full Text Available
Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

Stanton, S; Maddox, W; Gruver, N; Maffettone, P; Delaney, E; Greenside, P; Wilson, AG. (July 2022, International Conference on Machine Learning)
Dangers of Bayesian Model Averaging under Covariate Shift

Izmailov, P; Nicholson, P; Lotfi, S; Wilson, AG (December 2021, Neural Information Processing Systems)

Full Text Available
SKIing on Simplices: Kernel Interpolation on the Permutohedral Lattice for Scalable Gaussian Processes

Kapoor, S; Finzi, M; Wang, A; Wilson, AG (January 2021, International Conference on Machine Learning (ICML))
null (Ed.)
Full Text Available
Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Finzi, M; Stanton, S; Izmailov, P; Wilson, AG (January 2020, International Conference on Machine Learning)

The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. Incorporating equivariance to a new group requires implementing only the group exponential and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to images, ball-and-stick molecular data, and Hamiltonian dynamical systems. For Hamiltonian systems, the equivariance of our models is especially impactful, leading to exact conservation of linear and angular momentum.
more » « less
Full Text Available

Search for: All records