NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjana; Krishna, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL))

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot—i.e., without explicit supervision—that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains?In this work we evaluate zero-shot generated summaries across specialized domains including: biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles (The dataset can be downloaded from https://anonymous.4open.science/r/zero_shot_faceval_domains-9B83)
more » « less
Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjanal; Krisha, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, European Chapter of the Association for Computational Linguistics (EACL))

Full Text Available
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Ramprasad, Sanjana; Krishna, Kundan; Lipton, Zachary; Wallace, Byron (March 2024, Association for Computational Linguistics)
Graham, Yvette; Purver, Matthew (Ed.)
Full Text Available
Neural Network Approximations of PDEs Beyond Linearity: A Representational Perspective

Marwah, Tanya; Lipton, Zachary C; Lu, Jianfeng; Risteski, Andrej (October 2023, Proceedings of the 40th International Conference on Machine Learning)

Full Text Available
USB: A Unified Summarization Benchmark Across Tasks and Domains

Krishna, Kundan; Gupta, Prakhar; Ramprasad, Sanjana; Wallace, Byron; Bigham, Jeffrey; Lipton, Zachary (December 2023, Findings of EMNLP)

Full Text Available
From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research

https://doi.org/10.1145/3600211.3604661

Feffer, Michael; Skirpan, Michael; Lipton, Zachary; Heidari, Hoda (August 2023, ACM)
Moral Machine or Tyranny of the Majority?

https://doi.org/10.1609/aaai.v37i5.25739

Feffer, Michael; Heidari, Hoda; Lipton, Zachary C. (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

With artificial intelligence systems increasingly applied in consequential domains, researchers have begun to ask how AI systems ought to act in ethically charged situations where even humans lack consensus. In the Moral Machine project, researchers crowdsourced answers to Trolley Problems concerning autonomous vehicles. Subsequently, Noothigattu et al. (2018) proposed inferring linear functions that approximate each individual's preferences and aggregating these linear models by averaging parameters across the population. In this paper, we examine this averaging mechanism, focusing on fairness concerns and strategic effects. We investigate a simple setting where the population consists of two groups, the minority constitutes an α < 0.5 share of the population, and within-group preferences are homogeneous. Focusing on the fraction of contested cases where the minority group prevails, we make the following observations: (a) even when all parties report their preferences truthfully, the fraction of disputes where the minority prevails is less than proportionate in α; (b) the degree of sub-proportionality grows more severe as the level of disagreement between the groups increases; (c) when parties report preferences strategically, pure strategy equilibria do not always exist; and (d) whenever a pure strategy equilibrium exists, the majority group prevails 100% of the time. These findings raise concerns about stability and fairness of averaging as a mechanism for aggregating diverging voices. Finally, we discuss alternatives, including randomized dictatorship and median-based mechanisms.
more » « less
Full Text Available
Risk-limiting financial audits via weighted sampling without replacement

Shekhar, Shubhanshu; Xu, Ziyu; Lipton, Zachary; Liang, Pierre; Ramdas, Aaditya (August 2023, PMLR (JMLR W&CP))

Full Text Available
Domain Adaptation under Missingness Shift – Tackling Underreporting

Zhou, Helen; Balakrishnan, Sivaraman; Lipton Zachary (January 2023, AISTATS)

Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that if missing data indicators are available, DAMS reduces to covariate shift. Addressing cases where such indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal linear source predictor can perform arbitrarily worse on the target domain than always predicting the mean; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.
more » « less
Full Text Available
USB: A Unified Summarization Benchmark Across Tasks and Domains

https://doi.org/10.18653/v1/2023.findings-emnlp.592

Krishna, Kundan; Gupta, Prakhar; Ramprasad, Sanjana; Wallace, Byron; Bigham, Jeffrey; Lipton, Zachary (January 2023, Proceedings of the Findings of the Association for Computational Linguistics (EMNLP))

While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports 8 interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on 20× less human-labeled data. Our articles draw from 6 domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.
more » « less
Full Text Available

« Prev Next »

Search for: All records