skip to main content


Title: Self-Attention Networks Can Process Bounded Hierarchical Languages
Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck-k, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck-(k, D), the subset of Dyck-k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D+1 layers and O(log k) memory size (per token per layer) that recognizes Dyck-(k, D), and a soft-attention network with two layers and O(log k) memory size that generates Dyck-(k, D). Experiments show that self-attention networks trained on Dyck-(k, D) generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.  more » « less
Award ID(s):
2134059
NSF-PAR ID:
10378910
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the conference Association for Computational Linguistics Meeting
ISSN:
0736-587X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The best known solutions for k-message broadcast in dynamic networks of size n require Ω(nk) rounds. In this paper, we see if these bounds can be improved by smoothed analysis. To do so, we study perhaps the most natural randomized algorithm for disseminating tokens in this setting: at every time step, choose a token to broadcast randomly from the set of tokens you know. We show that with even a small amount of smoothing (i.e., one random edge added per round), this natural strategy solves k-message broadcast in Õ(n+k³) rounds, with high probability, beating the best known bounds for k = o(√n) and matching the Ω(n+k) lower bound for static networks for k = O(n^{1/3}) (ignoring logarithmic factors). In fact, the main result we show is even stronger and more general: given 𝓁-smoothing (i.e., 𝓁 random edges added per round), this simple strategy terminates in O(kn^{2/3}log^{1/3}(n)𝓁^{-1/3}) rounds. We then prove this analysis close to tight with an almost-matching lower bound. To better understand the impact of smoothing on information spreading, we next turn our attention to static networks, proving a tight bound of Õ(k√n) rounds to solve k-message broadcast, which is better than what our strategy can achieve in the dynamic setting. This confirms the intuition that although smoothed analysis reduces the difficulties induced by changing graph structures, it does not eliminate them altogether. Finally, we apply tools developed to support our smoothed analysis to prove an optimal result for k-message broadcast in so-called well-mixed networks in the absence of smoothing. By comparing this result to an existing lower bound for well-mixed networks, we establish a formal separation between oblivious and strongly adaptive adversaries with respect to well-mixed token spreading, partially resolving an open question on the impact of adversary strength on the k-message broadcast problem. 
    more » « less
  2. Motivated by an attempt to understand the formation and development of (human) language, we introduce a "distributed compression" problem. In our problem a sequence of pairs of players from a set of K players are chosen and tasked to communicate messages drawn from an unknown distribution Q. Arguably languages are created and evolve to compress frequently occurring messages, and we focus on this aspect. The only knowledge that players have about the distribution Q is from previously drawn samples, but these samples differ from player to player. The only common knowledge between the players is restricted to a common prior distribution P and some constant number of bits of information (such as a learning algorithm). Letting T_eps denote the number of iterations it would take for a typical player to obtain an eps-approximation to Q in total variation distance, we ask whether T_eps iterations suffice to compress the messages down roughly to their entropy and give a partial positive answer. We show that a natural uniform algorithm can compress the communication down to an average cost per message of O(H(Q) + log (D(P || Q) + O(1)) in $\tilde{O}(T_eps)$ iterations while allowing for O(eps)-error, where D(. || .) denotes the KL-divergence between distributions. For large divergences this compares favorably with the static algorithm that ignores all samples and compresses down to H(Q) + D(P || Q) bits, while not requiring (T_eps . K) iterations that it would take players to develop optimal but separate compressions for each pair of players. Along the way we introduce a "data-structural" view of the task of communicating with a natural language and show that our natural algorithm can also be implemented by an efficient data structure, whose storage is comparable to the storage requirements of Q and whose query complexity is comparable to the lengths of the message to be compressed. Our results give a plausible mathematical analogy to the mechanisms by which human languages get created and evolve, and in particular highlights the possibility of coordination towards a joint task (agreeing on a language) while engaging in distributed learning. 
    more » « less
  3. Transformer interpretability aims to understand the algorithm implemented by a learned Transformer by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be “nearly randomized”, while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even with severe constraints to the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading. 
    more » « less
  4. We use a recently developed synchronous Spiking Neural Network (SNN) model to study the problem of learning hierarchically-structured concepts. We introduce an abstract data model that describes simple hierarchical concepts. We define a feed-forward layered SNN model, with learning modeled using Oja’s local learning rule, a well known biologically-plausible rule for adjusting synapse weights. We define what it means for such a network to recognize hierarchical concepts; our notion of recognition is robust, in that it tolerates a bounded amount of noise. Then, we present a learning algorithm by which a layered network may learn to recognize hierarchical concepts according to our robust definition. We analyze correctness and performance rigorously; the amount of time required to learn each concept, after learning all of the sub-concepts, is approximately O ( 1ηk(`max log(k) + 1ε) + b log(k)), where k is the number of sub-concepts per concept, `max is the maximum hierarchical depth, η is the learning rate, ε describes the amount of uncertainty allowed in robust recognition, and b describes the amount of weight decrease for "irrelevant" edges. An interesting feature of this algorithm is that it allows the network to learn sub-concepts in a highly interleaved manner. This algorithm assumes that the concepts are presented in a noise-free way; we also extend these results to accommodate noise in the learning process. Finally, we give a simple lower bound saying that, in order to recognize concepts with hierarchical depth two with noise-tolerance, a neural network should have at least two layers. The results in this paper represent first steps in the theoretical study of hierarchical concepts using SNNs. The cases studied here are basic, but they suggest many directions for extensions to more elaborate and realistic cases. 
    more » « less
  5. We use a recently developed synchronous Spiking Neural Network (SNN) model to study the problem of learning hierarchically-structured concepts. We introduce an abstract data model that describes simple hierarchical concepts. We define a feed-forward layered SNN model, with learning modeled using Oja’s local learning rule, a well known biologically-plausible rule for adjusting synapse weights. We define what it means for such a network to recognize hierarchical concepts; our notion of recognition is robust, in that it tolerates a bounded amount of noise. Then, we present a learning algorithm by which a layered network may learn to recognize hierarchical concepts according to our robust definition. We an- alyze correctness and performance rigorously; the amount of time required to learn each concept, after learning all of the sub-concepts, is approximately O( 1ηk(`max log(k) + 1ε) + b log(k)), where k is the number of sub-concepts per concept, `max is the maximum hierarchical depth, η is the learning rate, ε describes the amount of uncertainty allowed in robust recognition, and b describes the amount of weight decrease for "irrelevant" edges. An interesting feature of this algorithm is that it allows the network to learn sub-concepts in a highly interleaved manner. This algorithm assumes that the concepts are presented in a noise-free way; we also extend these results to accommodate noise in the learning process. Finally, we give a simple lower bound saying that, in order to recognize concepts with hierarchical depth two with noise-tolerance, a neural network should have at least two layers. The results in this paper represent first steps in the theoretical study of hierarchical concepts using SNNs. The cases studied here are basic, but they suggest many directions for extensions to more elaborate and realistic cases 
    more » « less