skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 2, 2026

Title: Exact Expressive Power of Transformers with Padding
Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class FO -uniform TC of extremely parallelizable problems. While the TC upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with looping on inputs of length recognize exactly the class FO -uniform TC of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class FO -uniform NC , the best that could be expected without losing parallelism (unless NCP ). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.  more » « less
Award ID(s):
1922658
PAR ID:
10649787
Author(s) / Creator(s):
;
Publisher / Repository:
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems—the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer’s chain of thought or scratchpad impacts its reasoning power. 
    more » « less
  2. A $(β,δ,Δ)$-padded decomposition of an edge-weighted graph $G = (V,E,w)$ is a stochastic decomposition into clusters of diameter at most $$Δ$$ such that for every vertex $$v\in V$$, the probability that $$\rm{ball}_G(v,γΔ)$$ is entirely contained in the cluster containing $$v$$ is at least $$e^{-βγ}$$ for every $$γ\in [0,δ]$$. Padded decompositions have been studied for decades and have found numerous applications, including metric embedding, multicommodity flow-cut gap, multicut, and zero extension problems, to name a few. In these applications, parameter $$β$$, called the padding parameter, is the most important parameter since it decides either the distortion or the approximation ratios. For general graphs with $$n$$ vertices, $$β= Θ(\log n)$$. Klein, Plotkin, and Rao showed that $$K_r$$-minor-free graphs have padding parameter $β= O(r^3)$, which is a significant improvement over general graphs when $$r$$ is a constant. A long-standing conjecture is to construct a padded decomposition for $$K_r$$-minor-free graphs with padding parameter $$β= O(\log r)$$. Despite decades of research, the best-known result is $β= O(r)$, even for graphs with treewidth at most $$r$$. In this work, we make significant progress toward the aforementioned conjecture by showing that graphs with treewidth $$\rm{tw}$$ admit a padded decomposition with padding parameter $$O(\log \rm{tw})$$, which is tight. As corollaries, we obtain an exponential improvement in dependency on treewidth in a host of algorithmic applications: $$O(\sqrt{ \log n \cdot \log(\rm{tw})})$$ flow-cut gap, max flow-min multicut ratio of $$O(\log(\rm{tw}))$$, an $$O(\log(\rm{tw}))$$ approximation for the 0-extension problem, an $$\ell^{O(\log n)}_\infty$$ embedding with distortion $$O(\log \rm{tw})$$, and an $$O(\log \rm{tw})$$ bound for integrality gap for the uniform sparsest cut. 39 pages. This is the TheoretiCS journal version 
    more » « less
  3. Among the most challenging traffic-analysis attacks to confound are those leveraging the sizes of objects downloaded over the network. In this paper we systematically analyze this problem under realistic constraints regarding the padding overhead that the object store is willing to incur. We give algorithms to compute privacy-optimal padding schemes—specifically that minimize the network observer’s information gain from a downloaded object’s padded size—in several scenarios of interest: per-object padding, in which the object store responds to each request for an object with the same padded copy; per-request padding, in which the object store pads an object anew each time it serves that object; and a scenario unlike the previous ones in that the object store is unable to leverage a known distribution over the object queries. We provide constructions for privacy-optimal padding in each case, compare them to recent contenders in the research literature, and evaluate their performance on practical datasets. 
    more » « less
  4. Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm. 
    more » « less
  5. State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class 𝖳𝖢0. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems. 
    more » « less