Proceedings of the Sixteenth
(Ed.)
Instead of mining coherent topics from a given text corpus in a completely
unsupervised manner, seed-guided topic discovery methods
leverage user-provided seed words to extract distinctive and
coherent topics so that the mined topics can better cater to the
user’s interest. To model the semantic correlation between words
and seeds for discovering topic-indicative terms, existing seedguided
approaches utilize different types of context signals, such as
document-level word co-occurrences, sliding window-based local
contexts, and generic linguistic knowledge brought by pre-trained
language models. In this work, we analyze and show empirically
that each type of context information has its value and limitation
in modeling word semantics under seed guidance, but combining
three types of contexts (i.e., word embeddings learned from local
contexts, pre-trained language model representations obtained
from general-domain training, and topic-indicative sentences retrieved
based on seed information) allows them to complement
each other for discovering quality topics. We propose an iterative
framework, SeedTopicMine, which jointly learns from the three
types of contexts and gradually fuses their context signals via an
ensemble ranking process. Under various sets of seeds and on multiple
datasets, SeedTopicMine consistently yields more coherent
and accurate topics than existing seed-guided topic discovery approaches.
more »
« less