skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: TDLR: Top Semantic-Down Syntactic Language Representation
Language understanding involves processing text with both the grammatical and 2 common-sense contexts of the text fragments. The text “I went to the grocery store 3 and brought home a car” requires both the grammatical context (syntactic) and 4 common-sense context (semantic) to capture the oddity in the sentence. Contex5 tualized text representations learned by Language Models (LMs) are expected to 6 capture a variety of syntactic and semantic contexts from large amounts of training 7 data corpora. Recent work such as ERNIE has shown that infusing the knowl8 edge contexts, where they are available in LMs, results in significant performance 9 gains on General Language Understanding (GLUE) benchmark tasks. However, 10 to our knowledge, no knowledge-aware model has attempted to infuse knowledge 11 through top-down semantics-driven syntactic processing (Eg: Common-sense to 12 Grammatical) and directly operated on the attention mechanism that LMs leverage 13 to learn the data context. We propose a learning framework Top-Down Language 14 Representation (TDLR) to infuse common-sense semantics into LMs. In our 15 implementation, we build on BERT for its rich syntactic knowledge and use the 16 knowledge graphs ConceptNet and WordNet to infuse semantic knowledge.  more » « less
Award ID(s):
2133842
PAR ID:
10429264
Author(s) / Creator(s):
Date Published:
Journal Name:
NeurIPS'22 Workshop on All Things Attention: Bridging Different Perspectives on Attention
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP), 1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands. 
    more » « less
  2. With the advent of pre-trained language models (LMs), increasing research efforts have been focusing on infusing commonsense and domain-specific knowledge to prepare LMs for downstream tasks. These works attempt to leverage knowledge graphs, the de facto standard of symbolic knowledge representation, along with pre-trained LMs. While existing approaches leverage external knowledge, it remains an open question how to jointly incorporate knowledge graphs represented in varying contexts — from local (e.g., sentence), document-level, to global knowledge, to enable knowledge-rich and interpretable exchange across contexts. In addition, incorporating varying contexts can especially benefit long document understanding tasks that leverage pre-trained LMs, typically bounded by the input sequence length. In light of these challenges, we propose KALM, a language model that jointly leverages knowledge in local, document-level, and global contexts for long document understanding. KALM firstly encodes long documents and knowledge graphs into the three knowledge-aware context representations. KALM then processes each context with context-specific layers. These context-specific layers are followed by a ContextFusion layer that facilitates knowledge exchange to derive an overarching document representation. Extensive experiments demonstrate that KALM achieves state-of-the-art performance on three long document understanding tasks across 6 datasets/settings. Further analyses reveal that the three knowledge-aware contexts are complementary and they all contribute to model performance, while the importance and information exchange patterns of different contexts vary on different tasks and datasets. 
    more » « less
  3. null (Ed.)
    NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the growth of these different measures of model ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that these LMs require only about 10M to 100M words to learn to reliably encode most syntactic and semantic features we test. They need a much larger quantity of data in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other, unidentified, forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models. 
    more » « less
  4. One hallmark of human reasoning is that we can bring to bear a diverse web of common-sense knowledge in any situation. The vastness of our knowledge poses a challenge for the practical implementation of reasoning systems as well as for our cognitive theories – how do people represent their common-sense knowledge? On the one hand, our best models of sophisticated reasoning are top-down, making use primarily of symbolically-encoded knowledge. On the other, much of our understanding of the statistical properties of our environment may arise in a bottom-up fashion, for example through asso- ciationist learning mechanisms. Indeed, recent advances in AI have enabled the development of billion-parameter language models that can scour for patterns in gigabytes of text from the web, picking up a surprising amount of common-sense knowledge along the way—but they fail to learn the structure of coherent reasoning. We propose combining these approaches, by embedding language-model-backed primitives into a state- of-the-art probabilistic programming language (PPL). On two open-ended reasoning tasks, we show that our PPL models with neural knowledge components characterize the distribution of human responses more accurately than the neural language models alone, raising interesting questions about how people might use language as an interface to common-sense knowledge, and suggesting that building probabilistic models with neural language-model components may be a promising approach for more human-like AI. 
    more » « less
  5. Like all domains of cognition, language processing is affected by top–down knowledge. Classic evidence for this is missing blatant errors in the signal. In sentence comprehension, one instance is failing to notice word order errors, such as transposed words in the middle of a sentence: “you that read wrong” (Mirault et al., 2018). Our brains seem to fix such errors, since they are incompatible with our grammatical knowledge, but how do our brains do this? Following behavioral work on inner transpositions, we flashed four-word sentences for 300 ms using rapid parallel visual presentation (Snell and Grainger, 2017). We compared magnetoencephalography responses to fully grammatical and reversed sentences (24 human participants: 21 females, 4 males). The left lateral language cortex robustly distinguished grammatical and reversed sentences starting at 213 ms. Thus, the influence of grammatical knowledge begun rapidly after visual word form recognition (Tarkiainen et al., 1999). At the earliest stage of this neural “sentence superiority effect,” inner transpositions patterned between grammatical and reversed sentences, showing evidence that the brain initially “noticed” the error. However, 100 ms later, inner transpositions became indistinguishable from grammatical sentences, suggesting at this point, the brain had “fixed” the error. These results show that after a single glance at a sentence, syntax impacts our neural activity almost as quickly as higher-level object recognition is assumed to take place (Cichy et al., 2014). The earliest stage involves detailed comparisons between the bottom–up input and grammatical knowledge, while shortly afterward, top–down knowledge can override an error in the stimulus. 
    more » « less