Deep neural networks (DNNs) are widely used to handle many difficult tasks, such as image classification and malware detection, and achieve outstanding performance. However, recent studies on adversarial examples, which have maliciously undetectable perturbations added to their original samples that are indistinguishable by human eyes but mislead the machine learning approaches, show that machine learning models are vulnerable to security attacks. Though various adversarial retraining techniques have been developed in the past few years, none of them is scalable. In this paper, we propose a new iterative adversarial retraining approach to robustify the model and to reduce the effectiveness of adversarial inputs on DNN models. The proposed method retrains the model with both Gaussian noise augmentation and adversarial generation techniques for better generalization. Furthermore, the ensemble model is utilized during the testing phase in order to increase the robust test accuracy. The results from our extensive experiments demonstrate that the proposed approach increases the robustness of the DNN model against various adversarial attacks, specifically, fast gradient sign attack, Carlini and Wagner (C&W) attack, Projected Gradient Descent (PGD) attack, and DeepFool attack. To be precise, the robust classifier obtained by our proposed approach can maintain a performance accuracy of 99% on average on the standard test set. Moreover, we empirically evaluate the runtime of two of the most effective adversarial attacks, i.e., C&W attack and BIM attack, to find that the C&W attack can utilize GPU for faster adversarial example generation than the BIM attack can. For this reason, we further develop a parallel implementation of the proposed approach. This parallel implementation makes the proposed approach scalable for large datasets and complex models.
Explore the Transformation Space for Adversarial Images
Deep learning models are vulnerable to adversarial examples. Most of current adversarial attacks add pixel-wise perturbations restricted to some L^p-norm, and defense models are evaluated also on adversarial examples restricted inside L^p-norm balls. However, we wish to explore adversarial examples exist beyond L^p-norm balls and their implications for attacks and defenses. In this paper, we focus on adversarial images generated by transformations.
We start with color transformation and propose two gradient-based attacks. Since L^p-norm is inappropriate for measuring image quality in the transformation space, we use the similarity between transformations and the Structural Similarity Index. Next, we explore a larger transformation space consisting of combinations of color and affine transformations. We evaluate our transformation attacks on three data sets --- CIFAR10, SVHN, and ImageNet --- and their corresponding models. Finally, we perform retraining defenses to evaluate the strength of our attacks.
The results show that transformation attacks are powerful. They find high-quality adversarial images that have higher transferability and misclassification rates than C&W's L^p attacks, especially at high confidence levels. They are also significantly harder to defend against by retraining than C&W's L^p attacks. More importantly, exploring different attack spaces makes it more challenging to train a universally robust model.
more »
« less
- Award ID(s):
- 1801751
- NSF-PAR ID:
- 10156933
- Date Published:
- Journal Name:
- ACM Conference on Data and Application Security and Privacy
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
The pervasiveness of neural networks (NNs) in critical computer vision and image processing applications makes them very attractive for adversarial manipulation. A large body of existing research thoroughly investigates two broad categories of attacks targeting the integrity of NN models. The first category of attacks, commonly called Adversarial Examples, perturbs the model's inference by carefully adding noise into input examples. In the second category of attacks, adversaries try to manipulate the model during the training process by implanting Trojan backdoors. Researchers show that such attacks pose severe threats to the growing applications of NNs and propose several defenses against each attack type individually. However, such one-sided defense approaches leave potentially unknown risks in real-world scenarios when an adversary can unify different attacks to create new and more lethal ones bypassing existing defenses. In this work, we show how to jointly exploit adversarial perturbation and model poisoning vulnerabilities to practically launch a new stealthy attack, dubbed AdvTrojan. AdvTrojan is stealthy because it can be activated only when: 1) a carefully crafted adversarial perturbation is injected into the input examples during inference, and 2) a Trojan backdoor is implanted during the training process of the model. We leverage adversarial noise in the input space to move Trojan-infected examples across the model decision boundary, making it difficult to detect. The stealthiness behavior of AdvTrojan fools the users into accidentally trusting the infected model as a robust classifier against adversarial examples. AdvTrojan can be implemented by only poisoning the training data similar to conventional Trojan backdoor attacks. Our thorough analysis and extensive experiments on several benchmark datasets show that AdvTrojan can bypass existing defenses with a success rate close to 100% in most of our experimental scenarios and can be extended to attack federated learning as well as high-resolution images.more » « less
-
Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small ℓ∞-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model’s vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of ℓp-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. We propose new multi-perturbation adversarial training schemes, as well as an efficient attack for the ℓ1-norm, and use these to show that models trained against multiple attacks fail to achieve robustness competitive with that of models trained on each attack individually. In particular, we find that adversarial training with first-order ℓ∞, ℓ1 and ℓ2 attacks on MNIST achieves merely 50% robust accuracy, partly because of gradient-masking. Finally, we propose affine attacks that linearly interpolate between perturbation types and further degrade the accuracy of adversarially trained models.more » « less
-
Deepfake videos are getting better in quality and can be used for dangerous disinformation campaigns. The pressing need to detect these videos has motivated researchers to develop different types of detection models. Among them, the models that utilize temporal information (i.e., sequence-based models) are more effective at detection than the ones that only detect intra-frame discrepancies. Recent work has shown that the latter detection models can be fooled with adversarial examples, leveraging the rich literature on crafting adversarial (still) images. It is less clear, however, how well these attacks will work on sequence-based models that operate on information taken over multiple frames. In this paper, we explore the effectiveness of the Fast Gradient Sign Method (FGSM) and the Carlini-Wagner 𝐿2-norm attack to fool sequence-based deepfake detector models in both the white-box and black-box settings. The experimental results show that the attacks are effective with a maximum success rate of 99.72% and 67.14% in the white-box and black-box attack scenarios, respectively. This highlights the importance of developing more robust sequence-based deepfake detectors and opens up directions for future research.more » « less
-
null (Ed.)In the last couple of years, several adversarial attack methods based on different threat models have been proposed for the image classification problem. Most existing defenses consider additive threat models in which sample perturbations have bounded L_p norms. These defenses, however, can be vulnerable against adversarial attacks under non-additive threat models. An example of an attack method based on a non-additive threat model is the Wasserstein adversarial attack proposed by Wong et al. (2019), where the distance between an image and its adversarial example is determined by the Wasserstein metric ("earth-mover distance") between their normalized pixel intensities. Until now, there has been no certifiable defense against this type of attack. In this work, we propose the first defense with certified robustness against Wasserstein Adversarial attacks using randomized smoothing. We develop this certificate by considering the space of possible flows between images, and representing this space such that Wasserstein distance between images is upper-bounded by L_1 distance in this flow-space. We can then apply existing randomized smoothing certificates for the L_1 metric. In MNIST and CIFAR-10 datasets, we find that our proposed defense is also practically effective, demonstrating significantly improved accuracy under Wasserstein adversarial attack compared to unprotected models.more » « less