Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks.
more »
« less
On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining
Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses.
more »
« less
- PAR ID:
- 10311079
- Date Published:
- Journal Name:
- KDD'21:The 27th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining, August 14-18, 2021
- Volume:
- 2020
- Issue:
- 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)External knowledge is often useful for natural language understanding tasks. We introduce a contextual text representation model called Conceptual-Contextual (CC) embeddings, which incorporates structured knowledge into text representations. Unlike entity embedding methods, our approach encodes a knowledge graph into a context model. CC embeddings can be easily reused for a wide range of tasks in a similar fashion to pre-trained language models. Our model effectively encodes the huge UMLS database by leveraging semantic generalizability. Experiments on electronic health records (EHRs) and medical text processing benchmarks showed our model gives a major boost to the performance of supervised medical NLP tasks.more » « less
-
Mining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Existing topic models, however, typically work in a purely unsupervised way, which often generate topics that do not fit users’ particular needs and yield suboptimal performance on downstream tasks. We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This new task not only helps a user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop CatE, a novel category-name guided text embedding method for discriminative topic mining, which effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. We conduct a comprehensive set of experiments to show that CatE mines highquality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.more » « less
-
Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task.more » « less
-
The recent rapid development in Natural Language Processing (NLP) has greatly en- hanced the effectiveness of Intelligent Tutoring Systems (ITS) as tools for healthcare education. These systems hold the potential to improve health-related quality of life (HRQoL) outcomes, especially for populations with limited English reading and writing skills. However, despite the progress in pre-trained multilingual NLP models, there exists a noticeable research gap when it comes to code-switching within the medical context. Code-switching is a prevalent phenomenon in multilingual communities where individuals seamlessly transition between languages during conversations. This presents a distinctive challenge for healthcare ITS aimed at serving multilin- gual communities, as it demands a thorough understanding of and accurate adaptation to code- switching, which has thus far received limited attention in research. The hypothesis of our work asserts that the development of an ITS for healthcare education, culturally appropriate to the Hispanic population with frequent code-switching practices, is both achievable and pragmatic. Given that text classification is a core problem to many tasks in ITS, like sentiment analysis, topic classification, and smart replies, we target text classification as the application domain to validate our hypothesis. Our model relies on pre-trained word embeddings to offer rich representations for understand- ing code-switching medical contexts. However, training such word embeddings, especially within the medical domain, poses a significant challenge due to limited training corpora. In our approach to address this challenge, we identify distinct English and Spanish embeddings, each trained on medical corpora, and subsequently merge them into a unified vector space via space transforma- tion. In our study, we demonstrate that singular value decomposition (SVD) can be used to learn a linear transformation (a matrix), which aligns monolingual vectors from two languages in a single meta-embedding. As an example, we assessed the similarity between the words “cat” and “gato” both before and after alignment, utilizing the cosine similarity metric. Prior to alignment, these words exhibited a similarity score of 0.52, whereas after alignment, the similarity score increased to 0.64. This example illustrates that aligning the word vectors in a meta-embedding enhances the similarity between these words, which share the same meaning in their respective languages. To assess the quality of the representations in our meta-embedding in the context of code-switching, we employed a neural network to conduct text classification tasks on code-switching datasets. Our results demonstrate that, compared to pre-trained multilingual models, our model can achieve high performance in text classification tasks while utilizing significantly fewer parameters.more » « less