skip to main content


Title: Knowledge Enhanced Contextual Word Representations
Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs.  more » « less
Award ID(s):
1817183
NSF-PAR ID:
10180482
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Page Range / eLocation ID:
43 to 54
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In standard methodology for natural language processing, entities in text are typically embedded in dense vector spaces with pre-trained models. The embeddings produced this way are effective when fed into downstream models, but they require end-task fine-tuning and are fundamentally difficult to interpret. In this paper, we present an approach to creating entity representations that are human readable and achieve high performance on entity-related tasks out of the box. Our representations are vectors whose values correspond to posterior probabilities over fine-grained entity types, indicating the confidence of a typing model’s decision that the entity belongs to the corresponding type. We obtain these representations using a fine-grained entity typing model, trained either on supervised ultra-fine entity typing data (Choi et al., 2018) or distantly-supervised examples from Wikipedia. On entity probing tasks involving recognizing entity identity, our embeddings used in parameter-free downstream models achieve competitive performance with ELMo- and BERT-based embeddings in trained models. We also show that it is possible to reduce the size of our type set in a learning-based way for particular domains. Finally, we show that these embeddings can be post-hoc modified through a small number of rules to incorporate domain knowledge and improve performance. 
    more » « less
  2. Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR.Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embed-dings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited. 
    more » « less
  3. Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR.Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited. 
    more » « less
  4. The commonsense natural language inference (CNLI) tasks aim to select the most likely follow-up statement to a contextual description of ordinary, everyday events and facts. Current approaches to transfer learning of CNLI models across tasks require many labeled data from the new task. This paper presents a way to reduce this need for additional annotated training data from the new task by leveraging symbolic knowledge bases, such as ConceptNet. We formulate a teacher-student framework for mixed symbolic-neural reasoning, with the large-scale symbolic knowledge base serving as the teacher and a trained CNLI model as the student. This hybrid distillation process involves two steps. The first step is a symbolic reasoning process. Given a collection of unlabeled data, we use an abductive reasoning framework based on Grenander's pattern theory to create weakly labeled data. Pattern theory is an energy-based graphical probabilistic framework for reasoning among random variables with varying dependency structures. In the second step, the weakly labeled data, along with a fraction of the labeled data, is used to transfer-learn the CNLI model into the new task. The goal is to reduce the fraction of labeled data required. We demonstrate the efficacy of our approach by using three publicly available datasets (OpenBookQA, SWAG, and HellaSWAG) and evaluating three CNLI models (BERT, LSTM, and ESIM) that represent different tasks. We show that, on average, we achieve 63% of the top performance of a fully supervised BERT model with no labeled data. With only 1000 labeled samples, we can improve this performance to 72%. Interestingly, without training, the teacher mechanism itself has significant inference power. The pattern theory framework achieves 32.7% accuracy on OpenBookQA, outperforming transformer-based models such as GPT (26.6%), GPT-2 (30.2%), and BERT (27.1%) by a significant margin. We demonstrate that the framework can be generalized to successfully train neural CNLI models using knowledge distillation under unsupervised and semi-supervised learning settings. Our results show that it outperforms all unsupervised and weakly supervised baselines and some early supervised approaches, while offering competitive performance with fully supervised baselines. Additionally, we show that the abductive learning framework can be adapted for other downstream tasks, such as unsupervised semantic textual similarity, unsupervised sentiment classification, and zero-shot text classification, without significant modification to the framework. Finally, user studies show that the generated interpretations enhance its explainability by providing key insights into its reasoning mechanism. 
    more » « less
  5. Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks. 
    more » « less