NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Zhan, Hongli; Zheng, Allen; Lee, Yoon Kyung; Suh, Jina; Li, Junyi Jessy; Ong, Desmond C (October 2025, First Conference on Language Modeling (COLM 2024))

Large language models (LLMs) have offered new opportunities for emotional support, and recent work has shown that they can produce empathic responses to people in distress. However, long-term mental well-being requires emotional self-regulation, where a one-time empathic response falls short. This work takes a first step by engaging with cognitive reappraisals, a strategy from psychology practitioners that uses language to targetedly change negative appraisals that an individual makes of the situation; such appraisals is known to sit at the root of human emotional experience. We hypothesize that psychologically grounded principles could enable such advanced psychology capabilities in LLMs, and design RESORT which consists of a series of reappraisal constitutions across multiple dimensions that can be used as LLM instructions. We conduct a first-of-its-kind expert evaluation (by clinical psychologists with M.S. or Ph.D. degrees) of an LLM's zero-shot ability to generate cognitive reappraisal responses to medium-length social media messages asking for support. This fine-grained evaluation showed that even LLMs at the 7B scale guided by RESORT are capable of generating empathic responses that can help users reappraise their situations.
more » « less
Free, publicly-accessible full text available October 7, 2026
SPRI: Aligning Large Language Models with Context-Situated Principles

Zhan, Hongli; Azmat, Muneeza; Horesh, Raya; Li, Junyi Jessy; Yurochkin, Mikhail (July 2025, Proceedings of the Forty-Second International Conference on Machine Learning (ICML 2025))

Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness.
more » « less
Free, publicly-accessible full text available July 13, 2026
Behavioral Analysis of Information Salience in Large Language Models

https://doi.org/10.18653/v1/2025.findings-acl.1204

Trienes, Jan; Schlötterer, Jörg; Li, Junyi Jessy; Seifert, Christin (July 2025, Findings of the Association for Computational Linguistics: ACL 2025)

Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
more » « less
Free, publicly-accessible full text available July 1, 2026
A Linguistic Strategy to Measure Negative Affective Polarization Through Text Content

https://doi.org/10.1177/0261927X251338360

Garzón-Velandia, Diana_Camila; Pennebaker, James_W (May 2025, Journal of Language and Social Psychology)

Three studies developed and validated a linguistic dictionary to measure negative affective polarization in English and Spanish political texts. It captures three dimensions: negative affect, delegitimization, and political context. In the first study, two independent judges evaluated the candidate words, and reliability indicators were calculated, showing acceptable values for short texts (.572 in English, .541 in Spanish) and higher values for larger corpora (.964 in English, .957 in Spanish). The second study tested discriminant validity by comparing negative affective polarization scores in social media comments on politics and entertainment. Results showed significantly higher polarization scores in political content, confirming the dictionary's validity. The third study compared the dictionary to an existing online polarization measure, finding greater coverage and alignment with the construct. Additionally, it was observed that polarization scores were higher in texts containing hate speech compared to those where it was absent. The findings suggest that the dictionary in both languages have strong psychometric properties, making it a valuable tool for analyzing online content, particularly social media comments. It can be used as an independent measure or as input for machine and deep learning models.
more » « less
Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

https://doi.org/10.18653/v1/2025.findings-acl.1117

Sheffield, William Berkeley; Misra, Kanishka; Pyatkin, Valentina; Deo, Ashwini; Mahowald, Kyle; Li, Junyi Jessy (January 2025, indings of the Association for Computational Linguistics: ACL 2025)

Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects,as exemplified by the diverse uses of the particle *just* (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English *just*, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.
more » « less
Full Text Available
Do *they* mean ‘us’? Interpreting Referring Expression variation under Intergroup Bias

https://doi.org/10.18653/v1/2024.findings-emnlp.571

Govindarajan, Venkata S; Zang, Matianyu; Mahowald, Kyle; Beaver, David; Li, Junyi Jessy (November 2024, Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics)

The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that LLMs occasionally perform better when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances.
more » « less
Full Text Available
Large Language Models Produce Responses Perceived to be Empathic

https://doi.org/10.1109/ACII63134.2024.00012

Lee, Yoon Kyung; Suh, Jina; Zhan, Hongli; Li, Junyi Jessy; Ong, Desmond C (September 2024, IEEE)

Large Language Models (LLMs) have demonstrated surprising performance on many tasks, including writing supportive messages that display empathy. Here, we had these models generate empathic messages in response to posts describing common life experiences, such as workplace situations, parenting, relationships, and other anxiety- and anger-eliciting situations. Across two studies (N=192, 202), we showed human raters a variety of responses written by several models (GPT4 Turbo, Llama2, and Mistral), and had people rate these responses on how empathic they seemed to be. We found that LLM-generated responses were consistently rated as more empathic than human-written responses. Linguistic analyses also show that these models write in distinct, predictable “styles”, in terms of their use of punctuation, emojis, and certain words. These results highlight the potential of using LLMs to enhance human peer support in contexts where empathy is important.
more » « less
Full Text Available
Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion

Singh, Smriti; Caragea, Cornelia; Li, Junyi Jessy (June 2024, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)

Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.
more » « less
Full Text Available
Sarcasm Detection in a Disaster Context

Sosea, Tiberiu; Li, Junyi Jessy; Caragea, Cornelia (May 2024, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024))

During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural language understanding of disaster-related tweets. In this paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for intended sarcasm, and provide a comprehensive investigation of sarcasm detection using pre-trained language models. Our best model is able to obtain as much as 0.70 F1 on our dataset. We also demonstrate that the performance on HurricaneSARC can be improved by leveraging intermediate task transfer learning
more » « less
Full Text Available
Detection and Measurement of Syntactic Templates in Generated Text

https://doi.org/10.18653/v1/2024.emnlp-main.368

Shaib, Chantal; Elazar, Yanai; Li, Junyi Jessy; Wallace, Byron C (January 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024))

The diversity of text can be measured beyond word-level features, however existing diversity evaluation focuses primarily on word-level features. Here we propose a method for evaluating diversity over syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates (e.g., strings comprising parts-of-speech) and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference textsWe find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning or alignment processes such as RLHF. The connection between templates in generated text and the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data.We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions.Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
more » « less
Full Text Available

« Prev Next »

Search for: All records