Trojan Activation Attack: Red-Teaming Large Language Models using Steering Vectors for Safety-Alignment
- Award ID(s):
- 2506643
- PAR ID:
- 10576137
- Publisher / Repository:
- ACM International Conference on Information and Knowledge Management
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
No document suggestions found
An official website of the United States government

