Title: Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.  more » « less
1956151 1741317 1704532
Proc. 2023 ACM SIGIR Int. Conf. on Research and Development in Information Retrieval 
802 to 811
["Unsupervised Story Discovery, Mining Continuous News Streams, Scalable Thematic Embedding, Text Mining, Text Embedding"]
Taipei Taiwan
National Science Foundation
