skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Self-Attention Networks Can Process Bounded Hierarchical Languages
Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck-k, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck-(k, D), the subset of Dyck-k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D+1 layers and O(log k) memory size (per token per layer) that recognizes Dyck-(k, D), and a soft-attention network with two layers and O(log k) memory size that generates Dyck-(k, D). Experiments show that self-attention networks trained on Dyck-(k, D) generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.  more » « less
Award ID(s):
2134059
PAR ID:
10378910
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the conference Association for Computational Linguistics Meeting
ISSN:
0736-587X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations. 
    more » « less
  2. null (Ed.)
    Graph compression or sparsification is a basic information-theoretic and computational question. A major open problem in this research area is whether $$(1+\epsilon)$$-approximate cut-preserving vertex sparsifiers with size close to the number of terminals exist. As a step towards this goal, we initiate the study of a thresholded version of the problem: for a given parameter $$c$$, find a smaller graph, which we call \emph{connectivity-$$c$$ mimicking network}, which preserves connectivity among $$k$$ terminals exactly up to the value of $$c$$. We show that connectivity-$$c$$ mimicking networks of size $O(kc^4)$ exist and can be found in time $$m(c\log n)^{O(c)}$$. We also give a separate algorithm that constructs such graphs of size $$k \cdot O(c)^{2c}$$ in time $$mc^{O(c)}\log^{O(1)}n$$. These results lead to the first offline data structures for answering fully dynamic $$c$$-edge-connectivity queries for $$c \ge 4$$ in polylogarithmic time per query as well as more efficient algorithms for survivable network design on bounded treewidth graphs. 
    more » « less
  3. Transformer interpretability aims to understand the algorithm implemented by a learned Transformer by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be “nearly randomized”, while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even with severe constraints to the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading. 
    more » « less
  4. Many program analyses need to reason about pairs of matching actions, such as call/return, lock/unlock, or set-field/get-field. The family of Dyck languages {Dk}, whereDkhaskkinds of parenthesis pairs, can be used to model matching actions as balanced parentheses. Consequently, many program-analysis problems can be formulated as Dyck-reachability problems on edge-labeled digraphs.Interleaved Dyck-reachability(InterDyck-reachability), denoted byDk⊙Dk-reachability, is a natural extension of Dyck-reachability that allows one to formulate program-analysis problems that involvemultiplekinds of matching-action pairs. Unfortunately, the general InterDyck-reachability problem is undecidable. In this paper, we study variants of InterDyck-reachability onbidirected graphs, where for each edge ⟨p,q⟩ labeled by an open parenthesis ”(a”, there is an edge ⟨q,p⟩ labeled by the corresponding close parenthesis ”)a”, andvice versa. Language-reachability on a bidirected graph has proven to be useful both (1) in its own right, as a way to formalize many program-analysis problems, such as pointer analysis, and (2) as arelaxationmethod that uses a fast algorithm to over-approximate language-reachability on a directed graph. However, unlike its directed counterpart, the complexity of bidirected InterDyck-reachability still remains open. We establish the first decidable variant (i.e.,D1⊙D1-reachability) of bidirected InterDyck-reachability. InD1⊙D1-reachability, each of the two Dyck languages is restricted to have only a single kind of parenthesis pair. In particular, we show that the bidirectedD1⊙D1problem is in PTIME. We also show that when one extends each Dyck language to involvekdifferent kinds of parentheses (i.e.,Dk⊙Dk-reachability withk≥ 2), the problem is NP-hard (and therefore much harder). We have implemented the polynomial-time algorithm for bidirectedD1⊙D1-reachability.Dk⊙Dk-reachability provides a new over-approximation method for bidirectedDk⊙Dk-reachability in the sense thatDk⊙Dk-reachability can first be relaxed to bidirectedD1⊙D1-reachability, and then the resulting bidirectedD1⊙D1-reachability problem is solved precisely. We compare thisD1⊙D1-reachability-based approach against another known over-approximatingDk⊙Dk-reachability algorithm. Surprisingly, we found that the over-approximation approach based on bidirectedD1⊙D1-reachability computesmore precisesolutions, even though theD1⊙D1formalism is inherently less expressive than theDk⊙Dkformalism. 
    more » « less
  5. Motivated by an attempt to understand the formation and development of (human) language, we introduce a "distributed compression" problem. In our problem a sequence of pairs of players from a set of K players are chosen and tasked to communicate messages drawn from an unknown distribution Q. Arguably languages are created and evolve to compress frequently occurring messages, and we focus on this aspect. The only knowledge that players have about the distribution Q is from previously drawn samples, but these samples differ from player to player. The only common knowledge between the players is restricted to a common prior distribution P and some constant number of bits of information (such as a learning algorithm). Letting T_eps denote the number of iterations it would take for a typical player to obtain an eps-approximation to Q in total variation distance, we ask whether T_eps iterations suffice to compress the messages down roughly to their entropy and give a partial positive answer. We show that a natural uniform algorithm can compress the communication down to an average cost per message of O(H(Q) + log (D(P || Q) + O(1)) in $$\tilde{O}(T_eps)$$ iterations while allowing for O(eps)-error, where D(. || .) denotes the KL-divergence between distributions. For large divergences this compares favorably with the static algorithm that ignores all samples and compresses down to H(Q) + D(P || Q) bits, while not requiring (T_eps . K) iterations that it would take players to develop optimal but separate compressions for each pair of players. Along the way we introduce a "data-structural" view of the task of communicating with a natural language and show that our natural algorithm can also be implemented by an efficient data structure, whose storage is comparable to the storage requirements of Q and whose query complexity is comparable to the lengths of the message to be compressed. Our results give a plausible mathematical analogy to the mechanisms by which human languages get created and evolve, and in particular highlights the possibility of coordination towards a joint task (agreeing on a language) while engaging in distributed learning. 
    more » « less