skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream
Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.  more » « less
Award ID(s):
1956151 1741317 1704532 2019897
PAR ID:
10467098
Author(s) / Creator(s):
; ;
Corporate Creator(s):
Editor(s):
Proc. 2023 The Web Conf. 
Publisher / Repository:
ACM
Date Published:
Edition / Version:
1
ISBN:
9781450394161
Page Range / eLocation ID:
1650 to 1661
Subject(s) / Keyword(s):
Prototype-driven Continuous Summarization, Evolving Multi-document Sets Stream, text stream mining
Format(s):
Medium: X
Location:
Austin TX USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Carlotta Demeniconi, Ian Davidson: (Ed.)
    Multi-document summarization, which summarizes a set of documents with a small number of phrases or sentences, provides a concise and critical essence of the documents. Existing multi-document summarization methods ignore the fact that there often exist many relevant documents that provide surrounding background knowledge, which can help generate a salient and discriminative summary for a given set of documents. In this paper, we propose a novel method, SUMDocS (Surrounding-aware Unsupervised Multi-Document Summarization), which incorporates rich surrounding (topically related) documents to help improve the quality of extractive summarization without human supervision. Speci fically, we propose a joint optimization algorithm to unify global novelty (i.e., category-level frequent and discriminative), local consistency (i.e., locally frequent, co-occurring), and local saliency (i.e., salient from its surroundings) such that the obtained summary captures the characteristics of the target documents. Extensive experiments on news and scientifi c domains demonstrate the superior performance of our method when the unlabeled surrounding corpus is utilized. 
    more » « less
  2. Social media9s explosive growth has resulted in a massive influx of electronic documents influencing various facets of daily life. However, the enormous and complex nature of this content makes extracting valuable insights challenging. Long document summarization emerges as a pivotal technique in this context, serving to distill extensive texts into concise and comprehensible summaries. This paper presents a novel three-stage pipeline for effective long document summarization. The proposed approach combines unsupervised and supervised learning techniques, efficiently handling large document sets while requiring minimal computational resources. Our methodology introduces a unique process for forming semantic chunks through spectral dynamic segmentation, effectively reducing redundancy and repetitiveness in the summarization process. Contrary to previous methods, our approach aligns each semantic chunk with the entire summary paragraph, allowing the abstractive summarization model to process documents without truncation and enabling the summarization model to deduce missing information from other chunks. To enhance the summary generation, we utilize a sophisticated rewrite model based on Bidirectional and Auto- Regressive Transformers (BART), rearranging and reformulating summary constructs to improve their fluidity and coherence. Empirical studies conducted on the long documents from the Webis-TLDR-17 dataset demonstrate that our approach significantly enhances the efficiency of abstractive summarization transformers. The contributions of this paper thus offer significant advancements in the field of long document summarization, providing a novel and effective methodology for summarizing extensive texts in the context of social media. 
    more » « less
  3. Extractive summarization is an important natural language processing approach used for document compression, improved reading comprehension, key phrase extraction, indexing, query set generation, and other analytics approaches. Extractive summarization has specific advantages over abstractive summarization in that it preserves style, specific text elements, and compound phrases that might be more directly associated with the text. In this article, the relative effectiveness of extractive summarization is considered on two widely different corpora: (1) a set of works of fiction (100 total, mainly novels) available from Project Gutenberg, and (2) a large set of news articles (3000) for which a ground truthed summarization (gold standard) is provided by the authors of the news articles. Both sets were evaluated using 5 different Python Sumy algorithms and compared to randomly-generated summarizations quantitatively. Two functional approaches to assessing the efficacy of summarization using a query set on both the original documents and their summaries, and using document classification on a 12-class set to compare among different summarization approaches, are introduced. The results, unsurprisingly, show considerable differences consistent with the different nature of these two data sets. The LSA and Luhn summarization approaches were most effective on the database of fiction, while all five summarization approaches were similarly effective on the database of articles. Overall, the Luhn approach was deemed the most generally relevant among those tested. 
    more » « less
  4. null (Ed.)
    Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents. 
    more » « less
  5. We investigate pre-training techniques for abstractive multi-document summarization (MDS), which is much less studied than summarizing single documents. Though recent work has demonstrated the effectiveness of highlighting information salience for pretraining strategy design, they struggle to generate abstractive and reflective summaries, which are critical properties for MDS. To this end, we present PELMS, a pre-trained model that uses pre-training objectives based on semantic coherence heuristics and faithfulness constraints together with unlabeled multi-document inputs, to promote the generation of concise, fluent, and faithful summaries. To support the training of PELMS, we compile MultiPT, a multidocument pre-training corpus containing over 93 million documents to form more than 3 million unlabeled topic-centric document clusters, covering diverse genres such as product reviews, news, and general knowledge. We perform extensive evaluation of PELMS in lowshot settings on a wide range of MDS datasets. Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness, and with minimal fine-tuning can match performance of language models at a much larger scale (e.g., GPT-4). 
    more » « less