Alleviating the Fear of Losing Alignment in LLM Fine-tuning

Yang, Kang; Tao, Guanhong; Chen, Xun; Xu, Jun

doi:10.1109/SP61157.2025.00171

Citation Details

This content will become publicly available on May 12, 2026

Alleviating the Fear of Losing Alignment in LLM Fine-tuning

Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called alignment can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the aligned direction and the harmful direction. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment more »

Award ID(s):: 2319880 2029038

PAR ID:: 10630534

Author(s) / Creator(s):: Yang, Kang; Tao, Guanhong; Chen, Xun; Xu, Jun

Publisher / Repository:: IEEE

Date Published:: 2025-05-12

ISBN:: 979-8-3315-2236-0

Page Range / eLocation ID:: 2152 to 2170

Format(s):: Medium: X

Location:: San Francisco, CA, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 12, 2026
Conference Paper:
https://doi.org/10.1109/SP61157.2025.00171

More Like this