NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Mireshghallah, Niloofar; Kim, Hyunwoo; Zhou, Xuhui; Tsvetkov, Yulia; Sap, Maarten; Shokri, Reza; Choi, Yejin (May 2024, International Conference on Learning Representations)

Existing efforts on quantifying privacy implications for large language models (LLMs) solely focus on measuring leakage of training data. In this work, we shed light on the often-overlooked interactive settings where an LLM receives information from multiple sources and generates an output to be shared with other entities, creating the potential of exposing sensitive input data in inappropriate contexts. In these scenarios, humans nat- urally uphold privacy by choosing whether or not to disclose information depending on the context. We ask the question “Can LLMs demonstrate an equivalent discernment and reasoning capability when considering privacy in context?” We propose CONFAIDE, a benchmark grounded in the theory of contextual integrity and designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. CONFAIDE consists of four tiers, gradually increasing in complexity, with the final tier evaluating contextual privacy reasoning and theory of mind capabilities. Our experiments show that even commercial models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively, highlighting the urgent need for a new direction of privacy-preserving approaches as we demonstrate a larger underlying problem stemmed in the models’ lack of reasoning capabilities.
more » « less
Full Text Available
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents

https://doi.org/10.18653/v1/2024.acl-long.698

Wang, Ruiyi; Yu, Haofei; Zhang, Wenxin; Qi, Zhengyang; Sap, Maarten; Bisk, Yonatan; Neubig, Graham; Zhu, Hao (January 2024, Association for Computational Linguistics)

Full Text Available
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Zhou, Xuhui; Zhu, Hao; Mathur, Leena; Zhang, Ruohong; Yu, Haofei; Qi, Zhengyang; Morency, Louis-Philippe; Bisk, Yonatan; Fried, Daniel; Neubig, Graham; et al (January 2024, The Twelfth International Conference on Learning Representations)

Full Text Available
NLPositionality: Characterizing Design Biases of Datasets and Models

https://doi.org/10.18653/v1/2023.acl-long.505

Santy, Sebastin; Liang, Jenny; Le Bras, Ronan; Reinecke, Katharina; Sap, Maarten (January 2023, Association for Computational Linguistics)
COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements

https://doi.org/10.18653/v1/2023.findings-acl.392

Zhou, Xuhui; Zhu, Hao; Yerukola, Akhila; Davidson, Thomas; Hwang, Jena D.; Swayamdipta, Swabha; Sap, Maarten (January 2023, Findings of the Association for Computational Linguistics: ACL 2023)

Full Text Available
Misinfo Reaction Frames: Reasoning about Readers’ Reactions to News Headlines

https://doi.org/10.18653/v1/2022.acl-long.222

Gabriel, Saadia; Hallinan, Skyler; Sap, Maarten; Nguyen, Pemi; Roesner, Franziska; Choi, Eunsol; Choi, Yejin (January 2022, ACL)

Full Text Available
Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

https://doi.org/10.18653/v1/2021.emnlp-main.397

Baheti, Ashutosh; Sap, Maarten; Ritter, Alan; Riedl, Mark (January 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Challenges in Automated Debiasing for Toxic Language Detection

https://doi.org/10.18653/v1/2021.eacl-main.274

Zhou, Xuhui; Sap, Maarten; Swayamdipta, Swabha; Choi, Yejin; Smith, Noah (January 2021, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume)

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.
more » « less
Full Text Available
SOCIAL CHEMISTRY 101: Learning to Reason about Social and Moral Norms

Forbes, Maxwell; Hwang, Jena D; Shwartz, Vered; Sap, Maarten; Choi, Yejin (October 2020, EMNLP)

Full Text Available
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts

https://doi.org/10.18653/v1/2021.acl-long.522

Liu, Alisa; Sap, Maarten; Lu, Ximing; Swayamdipta, Swabha; Bhagavatula, Chandra; Smith, Noah A.; Choi, Yejin (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers))

Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DExperts: Decoding-time Experts, a decoding-time method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. We apply DExperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DExperts operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.
more » « less
Full Text Available

« Prev Next »

Search for: All records