BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Zeng, Yi; Sun, Weiyu; Huynh, Tran; Song, Dawn; Li, Bo; Jia, Ruoxi

doi:10.18653/v1/2024.emnlp-main.732

skip to main content

An official website of the United States government Here's how you know

Official websites use .gov

A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS

A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.

Citation Details

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Award ID(s):: 2424127

PAR ID:: 10628157

Author(s) / Creator(s):: Zeng, Yi; Sun, Weiyu; Huynh, Tran; Song, Dawn; Li, Bo; Jia, Ruoxi

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2024-11-12

Page Range / eLocation ID:: 13189 to 13215

Format(s):: Medium: X

Location:: Miami, Florida, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
https://doi.org/10.18653/v1/2024.emnlp-main.732

More Like this