skip to main content


Title: Generating Natural Language Adversarial Examples
Deep neural networks (DNNs) are vulnera- ble to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image do- main, these perturbations are often virtually indistinguishable to human perception, caus- ing humans and state-of-the-art models to dis- agree. However, in the natural language do- main, small perturbations are clearly percep- tible, and the replacement of a single word can drastically alter the semantics of the doc- ument. Given these challenges, we use a black-box population-based optimization al- gorithm to generate semantically and syntac- tically similar adversarial examples that fool well-trained sentiment analysis and textual en- tailment models with success rates of 97% and 70%, respectively. We additionally demon- strate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original label by 20 human annotators, and that the examples are perceptibly quite similar. Finally, we discuss an attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of our adversarial examples. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.  more » « less
Award ID(s):
1705135
NSF-PAR ID:
10112195
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Page Range / eLocation ID:
2890–2896
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over recent years, devising classification algorithms that are robust to adversarial perturbations hasemerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible tosmall imperceptible changes over test instances. However, the line of work inprovablerobustness,so far, has been focused oninformation theoreticrobustness, ruling out even theexistenceof anyadversarial examples. In this work, we study whether there is a hope to benefit fromalgorithmicnature of an attacker that searches for adversarial examples, and ask whether there isanylearning taskfor which it is possible to design classifiers that are only robust againstpolynomial-timeadversaries.Indeed, numerous cryptographic tasks (e.g. encryption of long messages) can only be secure againstcomputationally bounded adversaries, and are indeedimpossiblefor computationally unboundedattackers. Thus, it is natural to ask if the same strategy could help robust learning.We show that computational limitation of attackers can indeed be useful in robust learning bydemonstrating the possibility of a classifier for some learning task for which computational andinformation theoretic adversaries of bounded perturbations have very different power. Namely, whilecomputationally unbounded adversaries can attack successfully and find adversarial examples withsmall perturbation, polynomial time adversaries are unable to do so unless they can break standardcryptographic hardness assumptions. Our results, therefore, indicate that perhaps a similar approachto cryptography (relying on computational hardness) holds promise for achieving computationallyrobust machine learning. On the reverse directions, we also show that the existence of such learningtask in which computational robustness beats information theoretic robustness requires computationalhardness by implying (average-case) hardness o fNP. 
    more » « less
  2. Distinct scientific theories can make similar predictions. To adjudicate between theories, we must design experiments for which the theories make distinct predictions. Here we consider the problem of comparing deep neural networks as models of human visual recognition. To efficiently compare models’ ability to predict human responses, we synthesize controversial stimuli: images for which different models produce distinct responses. We applied this approach to two visual recognition tasks, handwritten digits (MNIST) and objects in small natural images (CIFAR-10). For each task, we synthesized controversial stimuli to maximize the disagreement among models which employed different architectures and recognition algorithms. Human subjects viewed hundreds of these stimuli, as well as natural examples, and judged the probability of presence of each digit/object category in each image. We quantified how accurately each model predicted the human judgments. The best-performing models were a generative analysis-by-synthesis model (based on variational autoencoders) for MNIST and a hybrid discriminative–generative joint energy model for CIFAR-10. These deep neural networks (DNNs), which model the distribution of images, performed better than purely discriminative DNNs, which learn only to map images to labels. None of the candidate models fully explained the human responses. Controversial stimuli generalize the concept of adversarial examples, obviating the need to assume a ground-truth model. Unlike natural images, controversial stimuli are not constrained to the stimulus distribution models are trained on, thus providing severe out-of-distribution tests that reveal the models’ inductive biases. Controversial stimuli therefore provide powerful probes of discrepancies between models and human perception.

     
    more » « less
  3. null (Ed.)
    Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Hu-mans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on un-labeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% ac-curacy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name1. 
    more » « less
  4. Abstract

    Deep neural networks (DNNs) are widely used to handle many difficult tasks, such as image classification and malware detection, and achieve outstanding performance. However, recent studies on adversarial examples, which have maliciously undetectable perturbations added to their original samples that are indistinguishable by human eyes but mislead the machine learning approaches, show that machine learning models are vulnerable to security attacks. Though various adversarial retraining techniques have been developed in the past few years, none of them is scalable. In this paper, we propose a new iterative adversarial retraining approach to robustify the model and to reduce the effectiveness of adversarial inputs on DNN models. The proposed method retrains the model with both Gaussian noise augmentation and adversarial generation techniques for better generalization. Furthermore, the ensemble model is utilized during the testing phase in order to increase the robust test accuracy. The results from our extensive experiments demonstrate that the proposed approach increases the robustness of the DNN model against various adversarial attacks, specifically, fast gradient sign attack, Carlini and Wagner (C&W) attack, Projected Gradient Descent (PGD) attack, and DeepFool attack. To be precise, the robust classifier obtained by our proposed approach can maintain a performance accuracy of 99% on average on the standard test set. Moreover, we empirically evaluate the runtime of two of the most effective adversarial attacks, i.e., C&W attack and BIM attack, to find that the C&W attack can utilize GPU for faster adversarial example generation than the BIM attack can. For this reason, we further develop a parallel implementation of the proposed approach. This parallel implementation makes the proposed approach scalable for large datasets and complex models.

     
    more » « less
  5. Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small ℓ∞-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model’s vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of ℓp-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. We propose new multi-perturbation adversarial training schemes, as well as an efficient attack for the ℓ1-norm, and use these to show that models trained against multiple attacks fail to achieve robustness competitive with that of models trained on each attack individually. In particular, we find that adversarial training with first-order ℓ∞, ℓ1 and ℓ2 attacks on MNIST achieves merely 50% robust accuracy, partly because of gradient-masking. Finally, we propose affine attacks that linearly interpolate between perturbation types and further degrade the accuracy of adversarially trained models. 
    more » « less