NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Jiang, Fengqing; Xu, Zhangchen; Niu, Luyao; Lin, Bill Yuchen; Poovendran, Radha (February 2025, The 39th Annual AAAI Conference on Artificial Intelligence)

Large language models (LLMs) are expected to follow in- structions from users and engage in conversations. Tech- niques to enhance LLMs’ instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessar- ily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We study two attacks to exploit the ChatBug vulnerability. Additionally, we demonstrate that the success of multiple existing attacks can be attributed to the ChatBug vulnerability. We show that a malicious user can exploit the ChatBug vulnerability of eight state-of-the- art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their at- tack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial train- ing effectively mitigates the ChatBug vulnerability, the vic- tim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research.
more » « less
Free, publicly-accessible full text available February 25, 2026
Who is Responsible? Explaining Safety Violations in Multi-Agent Cyber-Physical Systems

https://doi.org/10.1109/ICAA64256.2024.00012

Niu, Luyao; Zhang, Hongchao; Sahabandu, Dinuka; Ramasubramanian, Bhaskar; Clark, Andrew; Poovendran, Radha (October 2024, IEEE)

Full Text Available
EDC: Effective and Efficient Dialog Comprehension for Dialog State Tracking

Lu, Q; Ramasubramanian, B; Poovendran, Radha (June 2024, In Proceedings of findings of 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2024, findings))

Full Text Available
ACE: A model poisoning attack on contribution evaluation methods in federated learning

Xu, Z; Jiang, F; Niu, L; Jia, J; Li, Bo; Poovendran, Radha (August 2024, 33rd USENIX Security Symposium (USENIX Security 24))

Full Text Available
SafeDecoding: Defending against jailbreak attacks via safety-aware decoding

Xu, Z; Jiang, F; Niu, L; Jia, J; Li, Bo; Poovendran, Radha (August 2024, Annual Meeting of the Association for Computational Linguistics (ACL))

Full Text Available
ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs

Jiang, F; Xu, Z; Niu, L; Xiang, Z; Li, Bo; Poovendran, Radha (August 2024, Annual Meeting of the Association for Computational Linguistics (ACL))

Full Text Available
ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs

Jiang, F; Xu, Z; Niu, L; Xiang, Z; Li, Bo; Poovendran, Radha (August 2024, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15157–15173)

Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (VITC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs. Our code is available at https: //github.com/uw-nsl/ArtPrompt.
more » « less
Full Text Available
Risk-Aware Distributed Multi-Agent Reinforcement Learning

https://doi.org/10.23919/ACC60939.2024.10644829

Al_Maruf, Abdullah; Niu, Luyao; Ramasubramanian, Bhaskar; Clark, Andrew; Poovendran, Radha (July 2024, IEEE)

Full Text Available
Fault Tolerant Neural Control Barrier Functions for Robotic Systems under Sensor Faults and Attacks

https://doi.org/10.1109/ICRA57147.2024.10610491

Zhang, Hongchao; Niu, Luyao; Clark, Andrew; Poovendran, Radha (May 2024, IEEE)

Full Text Available
Poster: Double-Dip: Thwarting Label-Only Membership Inference Attacks with Transfer Learning and Randomization

Rajabi, A; Pimple, R; Janardhanan, A; Asokraj, S; Ramasubramanian, B; Poovendran, Radha (July 2024, ACM Asia Conference on Computer and Communications Security (ACM AsiaCCS))

Full Text Available

« Prev Next »

Search for: All records