null
(Ed.)
Recent years have witnessed the enormous success of text representation
learning in a wide range of text mining tasks. Earlier
word embedding learning approaches represent words as fixed
low-dimensional vectors to capture their semantics. The word embeddings
so learned are used as the input features of task-specific
models. Recently, pre-trained language models (PLMs), which learn
universal language representations via pre-training Transformer-based
neural models on large-scale text corpora, have revolutionized
the natural language processing (NLP) field. Such pre-trained
representations encode generic linguistic features that can be transferred
to almost any text-related applications. PLMs outperform
previous task-specific models in many applications as they only
need to be fine-tuned on the target corpus instead of being trained
from scratch.
In this tutorial, we introduce recent advances in pre-trained text
embeddings and language models, as well as their applications to a
wide range of text mining tasks. Specifically, we first overview a
set of recently developed self-supervised and weakly-supervised
text embedding methods and pre-trained language models that
serve as the fundamentals for downstream tasks. We then present
several new methods based on pre-trained text embeddings and
language models for various text mining applications such as topic
discovery and text classification. We focus on methods that are
weakly-supervised, domain-independent, language-agnostic, effective
and scalable for mining and discovering structured knowledge
from large-scale text corpora. Finally, we demonstrate with real world
datasets how pre-trained text representations help mitigate
the human annotation burden and facilitate automatic, accurate
and efficient text analyses.
more »
« less