skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SUMDocS: Surrounding-aware Unsupervised Multi-Document Summarization
Multi-document summarization, which summarizes a set of documents with a small number of phrases or sentences, provides a concise and critical essence of the documents. Existing multi-document summarization methods ignore the fact that there often exist many relevant documents that provide surrounding background knowledge, which can help generate a salient and discriminative summary for a given set of documents. In this paper, we propose a novel method, SUMDocS (Surrounding-aware Unsupervised Multi-Document Summarization), which incorporates rich surrounding (topically related) documents to help improve the quality of extractive summarization without human supervision. Speci fically, we propose a joint optimization algorithm to unify global novelty (i.e., category-level frequent and discriminative), local consistency (i.e., locally frequent, co-occurring), and local saliency (i.e., salient from its surroundings) such that the obtained summary captures the characteristics of the target documents. Extensive experiments on news and scientifi c domains demonstrate the superior performance of our method when the unlabeled surrounding corpus is utilized.  more » « less
Award ID(s):
1956151 1704532 1741317
PAR ID:
10311073
Author(s) / Creator(s):
Editor(s):
Carlotta Demeniconi, Ian Davidson:
Date Published:
Journal Name:
Proceedings of the 2021 {SIAM} International Conference on Data Mining, {SDM} 2021
Volume:
2021
Issue:
1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Proc. 2023 The Web Conf. (Ed.)
    Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings. 
    more » « less
  2. Extractive text summarization aims at extract- ing the most representative sentences from a given document as its summary. To extract a good summary from a long text document, sen- tence embedding plays an important role. Re- cent studies have leveraged graph neural net- works to capture the inter-sentential relation- ship (e.g., the discourse graph) to learn con- textual sentence embedding. However, those approaches neither consider multiple types of inter-sentential relationships (e.g., semantic similarity & natural connection), nor model intra-sentential relationships (e.g, semantic & syntactic relationship among words). To ad- dress these problems, we propose a novel Mul- tiplex Graph Convolutional Network (Multi- GCN) to jointly model different types of rela- tionships among sentences and words. Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extrac- tive text summarization. Finally, we evaluate the proposed models on the CNN/DailyMail benchmark dataset to demonstrate the effec- tiveness of our method. 
    more » « less
  3. Social media9s explosive growth has resulted in a massive influx of electronic documents influencing various facets of daily life. However, the enormous and complex nature of this content makes extracting valuable insights challenging. Long document summarization emerges as a pivotal technique in this context, serving to distill extensive texts into concise and comprehensible summaries. This paper presents a novel three-stage pipeline for effective long document summarization. The proposed approach combines unsupervised and supervised learning techniques, efficiently handling large document sets while requiring minimal computational resources. Our methodology introduces a unique process for forming semantic chunks through spectral dynamic segmentation, effectively reducing redundancy and repetitiveness in the summarization process. Contrary to previous methods, our approach aligns each semantic chunk with the entire summary paragraph, allowing the abstractive summarization model to process documents without truncation and enabling the summarization model to deduce missing information from other chunks. To enhance the summary generation, we utilize a sophisticated rewrite model based on Bidirectional and Auto- Regressive Transformers (BART), rearranging and reformulating summary constructs to improve their fluidity and coherence. Empirical studies conducted on the long documents from the Webis-TLDR-17 dataset demonstrate that our approach significantly enhances the efficiency of abstractive summarization transformers. The contributions of this paper thus offer significant advancements in the field of long document summarization, providing a novel and effective methodology for summarizing extensive texts in the context of social media. 
    more » « less
  4. Long document summarization systems are critical for domains with lengthy and jargonladen text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divideand- conquer strategies. The former reduces theoretical time complexity, but is still memoryheavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, meeting transcripts, screenplays, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a smaller GPU memory footprint. 
    more » « less
  5. null (Ed.)
    Query Biased Summarization (QBS) aims to produce a summary of the documents retrieved against a query to reduce the human effort for inspecting the full-text content of a document. Typical summarization approaches extract a document text snippet that has term overlap with the query and show that to a searcher. While snippets show relevant information in a document, to the best of our knowledge, there does not exist a summarization system that shows what relevant concepts is missing in a document. Our study focuses on the reduction of user effort in finding relevant documents by exposing them to omitted relevant information. To this end, we use a classical approach, DSPApprox, to find terms or phrases relevant to a query. Then we identify which terms or phrases are missing in a document, present them in a search interface, and ask crowd workers to judge document relevance based on snippets and missing information. Experimental results show both benefits and limitations of this approach. 
    more » « less