Backdoor attacks have been shown to be a serious threat against deep learning systems such as biometric authentication and autonomous driving. An effective backdoor attack could enforce the model misbehave under certain predefined conditions, i.e., triggers, but behave normally otherwise. The triggers of existing attacks are mainly injected in the pixel space, which tend to be visually identifiable at both training and inference stages and detectable by existing defenses. In this paper, we propose a simple but effective and invisible black-box backdoor attack FTROJAN through trojaning the frequency domain. The key intuition is that triggering perturbations in the frequency domain correspond to small pixel-wise perturbations dispersed across the entire image, breaking the underlying assumptions of existing defenses and making the poisoning images visually indistinguishable from clean ones. Extensive experimental evaluations show that FTROJAN is highly effective and the poisoning images retain high perceptual quality. Moreover, we show that FTROJAN can robustly elude or significantly degenerate the performance of existing defenses.
A Synergetic Attack against Neural Network Classifiers combining Backdoor and Adversarial Examples
The pervasiveness of neural networks (NNs) in critical computer vision and image processing applications makes them very attractive for adversarial manipulation. A large body of existing research thoroughly investigates two broad categories of attacks targeting the integrity of NN models. The first category of attacks, commonly called Adversarial Examples, perturbs the model's inference by carefully adding noise into input examples. In the second category of attacks, adversaries try to manipulate the model during the training process by implanting Trojan backdoors. Researchers show that such attacks pose severe threats to the growing applications of NNs and propose several defenses against each attack type individually. However, such one-sided defense approaches leave potentially unknown risks in real-world scenarios when an adversary can unify different attacks to create new and more lethal ones bypassing existing defenses. In this work, we show how to jointly exploit adversarial perturbation and model poisoning vulnerabilities to practically launch a new stealthy attack, dubbed AdvTrojan. AdvTrojan is stealthy because it can be activated only when: 1) a carefully crafted adversarial perturbation is injected into the input examples during inference, and 2) a Trojan backdoor is implanted during the training process of the model. We leverage adversarial noise in the more »
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- IEEE International Conference on Big Data
- Sponsoring Org:
- National Science Foundation
More Like this
Federated learning (FL) allows a set of agents to collaboratively train a model without sharing their potentially sensitive data. This makes FL suitable for privacy-preserving applications. At the same time, FL is susceptible to adversarial attacks due to decentralized and unvetted data. One important line of attacks against FL is the backdoor attacks. In a backdoor attack, an adversary tries to embed a backdoor functionality to the model during training that can later be activated to cause a desired misclassification. To prevent backdoor attacks, we propose a lightweight defense that requires minimal change to the FL protocol. At a high level, our defense is based on carefully adjusting the aggregation server's learning rate, per dimension and per round, based on the sign information of agents' updates. We first conjecture the necessary steps to carry a successful backdoor attack in FL setting, and then, explicitly formulate the defense based on our conjecture. Through experiments, we provide empirical evidence that supports our conjecture, and we test our defense against backdoor attacks under different settings. We observe that either backdoor is completely eliminated, or its accuracy is significantly reduced. Overall, our experiments suggest that our defense significantly outperforms some of the recently proposedmore »
Data poisoning attacks and backdoor attacks aim to corrupt a machine learning classifier via modifying, adding, and/or removing some carefully selected training examples, such that the corrupted classifier makes incorrect predictions as the attacker desires. The key idea of state-of-the-art certified defenses against data poisoning attacks and backdoor attacks is to create a majority vote mechanism to predict the label of a testing example. Moreover, each voter is a base classifier trained on a subset of the training dataset. Classical simple learning algorithms such as k nearest neighbors (kNN) and radius nearest neighbors (rNN) have intrinsic majority vote mechanisms. In this work, we show that the intrinsic majority vote mechanisms in kNN and rNN already provide certified robustness guarantees against data poisoning attacks and backdoor attacks. Moreover, our evaluation results on MNIST and CIFAR10 show that the intrinsic certified robustness guarantees of kNN and rNN outperform those provided by state-of-the-art certified defenses. Our results serve as standard baselines for future certified defenses against data poisoning attacks and backdoor attacks.
A backdoor data poisoning attack is an adversarial attack wherein the attacker injects several watermarked, mislabeled training examples into a training set. The watermark does not impact the test-time performance of the model on typical data; however, the model reliably errs on watermarked examples. To gain a better foundational understanding of backdoor data poisoning attacks, we present a formal theoretical framework within which one can discuss backdoor data poisoning attacks for classification problems. We then use this to analyze important statistical and computational issues surrounding these attacks. On the statistical front, we identify a parameter we call the memorization capacity that captures the intrinsic vulnerability of a learning problem to a backdoor attack. This allows us to argue about the robustness of several natural learning problems to backdoor attacks. Our results favoring the attacker involve presenting explicit constructions of backdoor attacks, and our robustness results show that some natural problem settings cannot yield successful backdoor attacks. From a computational standpoint, we show that under certain assumptions, adversarial training can detect the presence of backdoors in a training set. We then show that under similar assumptions, two closely related problems we call backdoor filtering and robust generalization are nearly equivalent. Thismore »
Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small Linf-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model’s vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of Lp-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding L1-bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order Linf, L1 and L2 adversaries to achieve merely 50% accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.