A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Weeks, Connor; Cheruvu, Aravind; Abdullah, Sifat Muhammad; Kanchi, Shravya; Yao, Daphne; Viswanath, Bimal

doi:10.1145/3627106.3627122

Citation Details

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets. Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots, by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers. more »

Award ID(s):: 2231002

PAR ID:: 10488549

Author(s) / Creator(s):: Weeks, Connor; Cheruvu, Aravind; Abdullah, Sifat Muhammad; Kanchi, Shravya; Yao, Daphne; Viswanath, Bimal

Publisher / Repository:: Proceedings of the 39th Annual Computer Security Applications Conference (ACSAC)

Date Published:: 2023-12-04

Journal Name:: Proceedings of the 39th Annual Computer Security Applications Conference (ACSAC)

ISBN:: 9798400708862

Page Range / eLocation ID:: 521 to 534

Format(s):: Medium: X

Location:: Austin TX USA

Sponsoring Org:: National Science Foundation

Conference Paper:
https://doi.org/10.1145/3627106.3627122

More Like this