NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do Multi-Document Summarization Models Synthesize ?

https://doi.org/10.1162/tacl_a_00687

DeYoung, Jay; Martinez, Stephanie C; Marshall, Iain J; Wallace, Byron C (September 2024, Transactions of the Association for Computational Linguistics)
Louis, Annie (Ed.)
Abstract Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
more » « less
Full Text Available
Automatically Extracting Numerical Results from RCTs with LLMs

Yun, Hye Sun; Pogrebitskiy, David; Marshall, Iain J; Wallace, Byron C (August 2024, Machine Learning for Healthcare (MLHC))

Full Text Available
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Yun, Hye Sun; Pogrebitskiy, David; Marshall, Iain James; Wallace, Byron C (August 2024, Proceedings of the 9th Machine Learning for Healthcare Conference (MLHC), Proceedings of Machine Learning Research)
Deshpande, Kaivalya; Fiterau, Madalina; Joshi, Shalmali; Lipton, Zachary; Ranganath, Rajesh; Urteaga, Iñigo (Ed.)
Full Text Available
Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Wadwha, Somin; DeYoung, Jay; Nye, Benjamin; Amir, Silvio; Wallace, Byron C (August 2024, Machine Learning for Healthcare (MLHC))

Full Text Available
Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Sun, Jiuding; Shaib, Chantal; Wallace, Byron C (May 2024, International Conference on Learning Representations)

Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper, we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.
more » « less
Full Text Available
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Trienes, Jan; Joseph, Sebastian; Scholotterer, Jorg; Seifert, Christin; Lo, Kyle; Xu, Wei; Wallace, Byron C; Li, Jessy (August 2024, Association for Computational Linguistics)

Full Text Available
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Antony, Sebastian; Chen, Lily; Trienes, Jan; Louisa, Hannah; Coers, Monika; Xu, Wei; Wallace, Byron C; Junyi, Jessy L (August 2024, Association for Computational Linguistics)

Full Text Available
On-the-fly Definition Augmentation of LLMs for Biomedical NER

Munnangi, Monica; Feldman, Sergey; Wallace, Byron; Amir, Silvio; Hope, Tom; Naik, Aakanksha (June 2024, North American Chapter of the Association for Computational Linguistics)

Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjana; Krishna, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL))

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot—i.e., without explicit supervision—that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains?In this work we evaluate zero-shot generated summaries across specialized domains including: biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles (The dataset can be downloaded from https://anonymous.4open.science/r/zero_shot_faceval_domains-9B83)
more » « less
Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjanal; Krisha, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, European Chapter of the Association for Computational Linguistics (EACL))

Full Text Available

« Prev Next »

Search for: All records