Vaccine: Perturbation-aware alignment for large language model

Huang, Tiansheng; Hu, Sihao; Liu, Ling

Citation Details

The new paradigm of fine-tuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the fine-tuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a harmful embedding drift phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users fine-tuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the fine-tuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at https://github.com/git-disl/Vaccine. more »

Award ID(s):: 2302720

PAR ID:: 10612915

Author(s) / Creator(s):: Huang, Tiansheng; Hu, Sihao; Liu, Ling

Publisher / Repository:: The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

Date Published:: 2024-12-09

Subject(s) / Keyword(s):: GenAI, Safety

Format(s):: Medium: X

Location:: Vancouver, Canada.

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Proceeding:
The DOI is not currently available.

More Like this