skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 24, 2026

Title: Looped Transformers for Length Generalization
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation—a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.  more » « less
Award ID(s):
2339978
PAR ID:
10596476
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
The Thirteenth International Conference on Learning Representations
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Deep Reinforcement Learning (DRL) has shown im- pressive performance on domains with visual inputs, in particular various games. However, the agent is usually trained on a fixed environment, e.g. a fixed number of levels. A growing mass of evidence suggests that these trained models fail to generalize to even slight variations of the environments they were trained on. This paper advances the hypothesis that the lack of generalization is partly due to the input representation, and explores how rotation, cropping and translation could increase generality. We show that a cropped, translated and rotated observation can get better generalization on unseen levels of two-dimensional arcade games from the GVGAI framework. The generality of the agents is evaluated on both human-designed and procedurally generated levels. 
    more » « less
  2. Generalization to unseen data is a key desideratum for deep networks, but its re- lation to classification accuracy is unclear. Using a minimalist vision dataset and a measure of generalizability, we show that popular networks, from deep con- volutional networks (CNNs) to transformers, vary in their power to extrapolate to unseen classes both across layers and across architectures. Accuracy is not a good predictor of generalizability, and generalization varies non-monotonically with layer depth. Our code is available at github.com/dyballa/generalization 
    more » « less
  3. Transformers, although first designed for sequence processing, can also handle unordered sets like point cloud data. Additionally, contrastive pretraining has emerged as a successful technique in image processing but remains unexplored for point cloud data. We develop and integrate a new point cloud pretraining technique inspired by the Simple Framework for Contrastive Learning (SimCLR) into the Set Transformer (ST) and Point Cloud Transformer (PCT) architectures and explore model performance using a novel 3D body scan dataset and the canonical datasets ShapeNet and ModelNet. For the 3D body scan dataset, this integration boosts initial training performance and maintains overall higher performance for classification tasks, and demonstrates better stability/convergence for regression tasks in comparison to non-pretrained (Naïve] counterparts. Furthermore, experiments examining strong generalization (relative performance on previously unseen classes) show improvement for pretrained models compared to Naïve models. Consistent benefits across tasks and data sets are observed based on additional experiments performed on the ShapeNet core dataset. Overall, we show how contrastive pretraining for point cloud data is a viable strategy for improving the performance of Transformers on downstream tasks and accelerating the training process. 
    more » « less
  4. Selecting suitable architecture parameters and training hyperparameters is essential for enhancing machine learning (ML) model performance. Several recent empirical studies conduct large-scale correlational analysis on neural networks (NNs) to search for effective generalization metrics that can guide this type of model selection. Effective metrics are typically expected to correlate strongly with test performance. In this paper, we expand on prior analyses by examining generalization-metric-based model selection with the following objectives: (i) focusing on natural language processing (NLP) tasks, as prior work primarily concentrates on computer vision (CV) tasks; (ii) considering metrics that directly predict test error instead of the generalization gap; (iii) exploring metrics that do not need access to data to compute. From these objectives, we are able to provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics. Our analyses consider (I) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (II) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including GPT2, BERT, etc., and (III) a total of 28 existing and novel generalization metrics. Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks, exhibiting stronger correlations than other, more popular metrics. To further examine these metrics, we extend prior formulations relying on power law (PL) spectral distributions to exponential (EXP) and exponentially-truncated power law (E-TPL) families. 
    more » « less
  5. Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. (2022). However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count. 
    more » « less