Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finitestate machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoderonly transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected prenorm (a slight generalization of standard prenorm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within contextsensitive languages, and polynomial steps with generalized prenorm make them recognize exactly the class of polynomialtime solvable problems—the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer’s chain of thought or scratchpad impacts its reasoning power.
more »
« less
The Parallelism Tradeoff: Limitations of LogPrecision Transformers
Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constantdepth logspaceuniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all polytime problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary contextfree grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
more »
« less
 Award ID(s):
 1922658
 NSFPAR ID:
 10437680
 Date Published:
 Journal Name:
 TACL
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


One way to interpret the reasoning power of transformerbased language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finiteprecision transformers can be equivalently expressed in a generalization of firstorder logic. However, finiteprecision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in logn precision on contexts of length n. We prove that any logprecision transformer can be equivalently expressed as a firstorder logic sentence that, in addition to standard universal and existential quantifiers, may also contain majorityvote quantifiers. This is the tightest known upper bound and first logical characterization of logprecision transformers.more » « less

This paper addresses the problem of creating abstract transformers automatically. The method we present automates the construction of static analyzers in a fashion similar to the way yacc automates the construction of parsers. Our method treats the problem as a programsynthesis problem. The user provides specifications of (i) the concrete semantics of a given operation op , (ii) the abstract domain A to be used by the analyzer, and (iii) the semantics of a domainspecific language L in which the abstract transformer is to be expressed. As output, our method creates an abstract transformer for op in abstract domain A , expressed in L (an “ L transformer for op over A ”). Moreover, the abstract transformer obtained is a mostprecise L transformer for op over A ; that is, there is no other L transformer for op over A that is strictly more precise. We implemented our method in a tool called AMURTH. We used AMURTH to create sets of replacement abstract transformers for those used in two existing analyzers, and obtained essentially identical performance. However, when we compared the existing transformers with the transformers obtained using AMURTH, we discovered that four of the existing transformers were unsound, which demonstrates the risk of using manually created transformers.more » « less

Statespace models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class 𝖳𝖢0. In particular, this means they cannot solve simple statetracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mambastyle SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to nonrecurrent models like transformers, which may fundamentally limit their ability to solve realworld statetracking problems.more » « less

Lowpower linefrequency transformers are common across several applications, but these transformers are found to have large standby losses, low efficiencies, large weights, and high costs. In this paper we optimize a power conversion stage for this and other applications where an unregulated, isolated converter is needed. The selected topology uses lowimpedance passives to reduce their size; to address this case we analyze operation of a lowerQ resonant tank. Lastly, because of the application of lowpower linefrequency transformers, we optimize around a typical usage cycle to minimize the yearly average power loss. For the application of a Class 2 linefrequency transformer, model results are accurate to simulation by within 12.1%, and optimization produced a design that was experimentally validated to achieve a yearly average power loss of 285.1 mW, a peak efficiency of 99.26% and a standby loss of 155 mW.more » « less