NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Cheng, Ziheng; Huang, Yixiao; Xu, Hui; Sojoudi, Somayeh; Zhao, Xuandong; Song, Dawn; Mei, Song (September 2025, Conference on Neural Information Processing Systems)

Free, publicly-accessible full text available September 18, 2026
HADES: Range-Filtered Private Aggregation on Public Data

Liu, Xiaoyuan; Trieu, Ni; Gupta, Trinabh; Ahmad, Ishtiyaque; Song, Dawn (July 2025, the International Conference on Very Large Data Bases (VLDB) 2025.)

Free, publicly-accessible full text available July 3, 2026
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

https://doi.org/10.1109/SP61157.2025.00250

Liu, Yupei; Jia, Yuqi; Jia, Jinyuan; Song, Dawn; Gong, Neil Zhenqiang (May 2025, IEEE)

Free, publicly-accessible full text available May 12, 2026
Advancing science- and evidence-based AI policy

https://doi.org/10.1126/science.adu8449

Bommasani, Rishi; Arora, Sanjeev; Chayes, Jennifer; Choi, Yejin; Cuéllar, Mariano-Florentino; Fei-Fei, Li; Ho, Daniel E; Jurafsky, Dan; Koyejo, Sanmi; Lakkaraju, Hima; et al (July 2025, Science)

Policy must be informed by, but also facilitate the generation of, scientific evidence
more » « less
Free, publicly-accessible full text available July 31, 2026
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

https://doi.org/10.18653/v1/2024.emnlp-main.732

Zeng, Yi; Sun, Weiyu; Huynh, Tran; Song, Dawn; Li, Bo; Jia, Ruoxi (November 2024, Association for Computational Linguistics)

Full Text Available
RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Guo, Chengquan; Liu, Xun; Xie, Chulin; Zhou, Andy; Zeng, Yi; Lin, Zinan; Song, Dawn; Li, Bo (December 2024, Proceedings of the the Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS))

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding and software development, safety and security concerns, such as generating or executing malicious code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests. RedCode consists of two parts to evaluate agents’ safety in unsafe code execution and generation: (1) RedCode-Exec provides challenging code prompts in Python as inputs, aiming to evaluate code agents’ ability to recognize and handle unsafe code. We then map the Python code to other programming languages (e.g., Bash) and natural text summaries or descriptions for evaluation, leading to a total of over 4,000 testing instances. We provide 25 types of critical vulnerabilities spanning various domains, such as websites, file systems, and operating systems. We provide a Docker sandbox environment to evaluate the execution capabilities of code agents and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents’ vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing unsafe operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen reveal that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are publicly available at https://github.com/AI-secure/RedCode.
more » « less
Full Text Available
GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration

Wang, Jiachen T; Wu, Tong; Song, Dawn; Mittal, Prateek; Jia, Ruoxi (September 2024, Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
GRATH: Gradual Self-Truthifying for Large Language Models

Chen, Weixin; Song, Dawn; Li, Bo (July 2024, International Conference on Machine Learning (ICML 2024))

Truthfulness is paramount for large language models (LLMs) as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs’ truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs. The code is available at https://github.com/chenweixin107/GRATH.
more » « less
Full Text Available
AIR-BENCH 2024: A Safety Benchmark based on Regulation and Policies Specified Risk Categories

Zeng, Yi; Yang, Yu; Zhou, Andy; Tan, Jeffrey; Tu, Yuheng; Mai, Yifan; Klyman, Kevin; Pan, Minzhou; Jia, Ruoxi; Song, Dawn; et al (January 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available January 22, 2026
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Ko, Myeongseob; Li, Henry; Wang, Zhun; Patsenker, Jonathan; Wang, Jiachen T; Li, Qinbin; Jin, Ming; Song, Dawn; Jia, Ruoxi (December 2024, Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available

« Prev Next »

Search for: All records