Generating Natural Language Adversarial Examples

Alzantot, Moustafa; Sharma, Yash; Elgohary, Ahmed; Ho, Bo-Jhang; Srivastava, Mani; Chang, Kai-Wei

Citation Details

Deep neural networks (DNNs) are vulnera- ble to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image do- main, these perturbations are often virtually indistinguishable to human perception, caus- ing humans and state-of-the-art models to dis- agree. However, in the natural language do- main, small perturbations are clearly percep- tible, and the replacement of a single word can drastically alter the semantics of the doc- ument. Given these challenges, we use a black-box population-based optimization al- gorithm to generate semantically and syntac- tically similar adversarial examples that fool well-trained sentiment analysis and textual en- tailment models with success rates of 97% and 70%, respectively. We additionally demon- strate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original label by 20 human annotators, and that the examples are perceptibly quite similar. Finally, we discuss an attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of our adversarial examples. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain. more »

Award ID(s):: 1705135

PAR ID:: 10112195

Author(s) / Creator(s):: Alzantot, Moustafa; Sharma, Yash; Elgohary, Ahmed; Ho, Bo-Jhang; Srivastava, Mani; Chang, Kai-Wei

Date Published:: 2018-01-01

Journal Name:: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Page Range / eLocation ID:: 2890–2896

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this