NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do Multi-Document Summarization Models Synthesize ?

https://doi.org/10.1162/tacl_a_00687

DeYoung, Jay; Martinez, Stephanie C; Marshall, Iain J; Wallace, Byron C (September 2024, Transactions of the Association for Computational Linguistics)
Louis, Annie (Ed.)
Abstract Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
more » « less
Full Text Available
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Yun, Hye Sun; Pogrebitskiy, David; Marshall, Iain James; Wallace, Byron C (August 2024, Proceedings of the 9th Machine Learning for Healthcare Conference (MLHC), Proceedings of Machine Learning Research)
Deshpande, Kaivalya; Fiterau, Madalina; Joshi, Shalmali; Lipton, Zachary; Ranganath, Rajesh; Urteaga, Iñigo (Ed.)
Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjana; Krishna, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL))

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot—i.e., without explicit supervision—that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains?In this work we evaluate zero-shot generated summaries across specialized domains including: biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles (The dataset can be downloaded from https://anonymous.4open.science/r/zero_shot_faceval_domains-9B83)
more » « less
Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjana; Krishna, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, Association for Computational Linguistics)
Graham, Yvette; Purver, Matthew (Ed.)
Full Text Available
Detection and Measurement of Syntactic Templates in Generated Text

https://doi.org/10.18653/v1/2024.emnlp-main.368

Shaib, Chantal; Elazar, Yanai; Li, Junyi Jessy; Wallace, Byron C (January 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024))

The diversity of text can be measured beyond word-level features, however existing diversity evaluation focuses primarily on word-level features. Here we propose a method for evaluating diversity over syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates (e.g., strings comprising parts-of-speech) and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference textsWe find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning or alignment processes such as RLHF. The connection between templates in generated text and the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data.We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions.Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
more » « less
Full Text Available
USB: A Unified Summarization Benchmark Across Tasks and Domains

https://doi.org/10.18653/v1/2023.findings-emnlp.592

Krishna, Kundan; Gupta, Prakhar; Ramprasad, Sanjana; Wallace, Byron; Bigham, Jeffrey; Lipton, Zachary (January 2023, Proceedings of the Findings of the Association for Computational Linguistics (EMNLP))

While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports 8 interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on 20× less human-labeled data. Our articles draw from 6 domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.
more » « less
Full Text Available
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

https://doi.org/10.18653/v1/2023.acl-long.549

Wang, Lucy Lu; Otmakhova, Yulia; DeYoung, Jay; Truong, Thinh Hung; Kuehl, Bailey; Bransom, Erin; Wallace, Byron (January 2023, Proceedings of the Association of Computational Linguistics (ACL))

Full Text Available
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

https://doi.org/10.18653/v1/2023.emnlp-main.626

Yun, Hye; Marshall, Iain; Trikalinos, Thomas; Wallace, Byron (January 2023, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP))

Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in LLMs offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.
more » « less
Full Text Available
Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges

Ramprasad, Sanjana; Mcinerney, Jered; Marshall, Iain; Wallace, Byron (January 2023, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations)

In this work we present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work, the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality.The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART, and a multi-headed architecture intended to provide greater transparency and controllability to end-users.Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present.The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video can be found at https://vimeo.com/735605060The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/
more » « less
Full Text Available
Overview of MSLR2022: A Shared Task on Multi-document Summarization for Literature Reviews

Lu Wang, Lucy; DeYoung, Jay; Wallace, Byron (January 2022, Proceedings of the Third Workshop on Scholarly Document Processing at COLING)

We provide an overview of the MSLR2022 shared task on multi-document summarization for literature reviews. The shared task was hosted at the Third Scholarly Document Processing (SDP) Workshop at COLING 2022. For this task, we provided data consisting of gold summaries extracted from review papers along with the groups of input abstracts that were synthesized into these summaries, split into two subtasks. In total, six teams participated, making 10 public submissions, 6 to the Cochrane subtask and 4 to the MSˆ2 subtask. The top scoring systems reported over 2 points ROUGE-L improvement on the Cochrane subtask, though performance improvements are not consistently reported across all automated evaluation metrics; qualitative examination of the results also suggests the inadequacy of current evaluation metrics for capturing factuality and consistency on this task. Significant work is needed to improve system performance, and more importantly, to develop better methods for automatically evaluating performance on this task.
more » « less
Full Text Available

Search for: All records