NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Behavioral Analysis of Information Salience in Large Language Models

https://doi.org/10.18653/v1/2025.findings-acl.1204

Trienes, Jan; Schlötterer, Jörg; Li, Junyi Jessy; Seifert, Christin (July 2025, Findings of the Association for Computational Linguistics: ACL 2025)

Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
more » « less
Free, publicly-accessible full text available July 1, 2026
Do *they* mean ‘us’? Interpreting Referring Expression variation under Intergroup Bias

https://doi.org/10.18653/v1/2024.findings-emnlp.571

Govindarajan, Venkata S; Zang, Matianyu; Mahowald, Kyle; Beaver, David; Li, Junyi Jessy (November 2024, Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics)

The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that LLMs occasionally perform better when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances.
more » « less
Full Text Available
Using Natural Language Explanations to Rescale Human Judgments

Wadhwa, Manya; Chen, Jifan; Li, Junyi Jessy; Durrett, Greg (July 2024, First Conference on Language Modeling (COLM))

Full Text Available
Which questions should I answer? Salience Prediction of Inquisitive Questions

https://doi.org/10.18653/v1/2024.emnlp-main.1114

Wu, Yating; Mangla, Ritika Rajesh; Dimakis, Alex; Durrett, Greg; Li, Junyi Jessy (January 2024, Proceedings of the Conference on Empirical Methods in Natural Language Processing (published by Association for Computational Linguistics))

Full Text Available
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

https://doi.org/10.18653/v1/2024.acl-long.234

Trienes, Jan; Joseph, Sebastian; Schlötterer, Jörg; Seifert, Christin; Lo, Kyle; Xu, Wei; Wallace, Byron; Li, Junyi Jessy (January 2024, Association for Computational Linguistics)

Full Text Available
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

https://doi.org/10.18653/v1/2024.acl-long.459

Joseph, Sebastian; Chen, Lily; Trienes, Jan; Göke, Hannah; Coers, Monika; Xu, Wei; Wallace, Byron; Li, Junyi Jessy (January 2024, Association for Computational Linguistics)

Full Text Available
Learning to Refine with Fine-Grained Natural Language Feedback

https://doi.org/10.18653/v1/2024.findings-emnlp.716

Wadhwa, Manya; Zhao, Xinyu; Li, Junyi Jessy; Durrett, Greg (January 2024, Findings of the Association for Computational Linguistics: EMNLP 2024)

Full Text Available
Detection and Measurement of Syntactic Templates in Generated Text

https://doi.org/10.18653/v1/2024.emnlp-main.368

Shaib, Chantal; Elazar, Yanai; Li, Junyi Jessy; Wallace, Byron C (January 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024))

The diversity of text can be measured beyond word-level features, however existing diversity evaluation focuses primarily on word-level features. Here we propose a method for evaluating diversity over syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates (e.g., strings comprising parts-of-speech) and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference textsWe find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning or alignment processes such as RLHF. The connection between templates in generated text and the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data.We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions.Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
more » « less
Full Text Available
Multilingual Simplification of Medical Texts

https://doi.org/10.18653/v1/2023.emnlp-main.1037

Joseph, Sebastian; Kazanas, Kathryn; Reina, Keziah; Ramanathan, Vishnesh; Xu, Wei; Wallace, Byron; Li, Junyi (December 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)

Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text simplification has focused on monolingual settings, with the result that such evidence would be available only in just one language (most often, English). This work addresses this limitation via multilingual simplification, i.e., directly simplifying complex texts into simplified texts in multiple languages. We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and zero-shot models across these languages with extensive human assessments and analyses. Although models can generate viable simplified texts, we identify several outstanding challenges that this dataset might be used to address.
more » « less
Full Text Available
QUDeval: The Evaluation of Questions Under Discussion Discourse Parsing

https://doi.org/10.18653/v1/2023.emnlp-main.325

Wu, Yating; Mangla, Ritika; Durrett, Greg; Li, Junyi Jessy (December 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)

Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them. Automatic parsing of a discourse to produce a QUD structure thus entails a complex question generation task: given a document and an answer sentence, generate a question that satisfies linguistic constraints of QUD and can be grounded in an anchor sentence in prior context. These questions are known to be curiosity-driven and open-ended. This work introduces the first framework for the automatic evaluation of QUD parsing, instantiating the theoretical constraints of QUD in a concrete protocol. We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs. Using QUDeval, we show that satisfying all constraints of QUD is still challenging for modern LLMs, and that existing evaluation metrics poorly approximate parser quality. Encouragingly, human-authored QUDs are scored highly by our human evaluators, suggesting that there is headroom for further progress on language modeling to improve both QUD parsing and QUD evaluation.
more » « less
Full Text Available

« Prev Next »

Search for: All records