skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Gradient-based Analysis of NLP Models is Manipulable
Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, the fact that they directly reflect the model internals. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions. This Facade Model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (sentiment analysis, NLI, and QA), we show that the merged model effectively fools different analysis tools: saliency maps differ significantly from the original model’s, input reduction keeps more irrelevant input tokens, and adversarial perturbations identify unimportant tokens as being highly important.  more » « less
Award ID(s):
1756023
PAR ID:
10347371
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Findings of the Association for Computational Linguistics: EMNLP 2020
Page Range / eLocation ID:
247 to 258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Explainable NLP techniques primarily explain by answering “Which tokens in the input are responsible for this prediction?”. We argue that for NLP models that make predictions by comparing two input texts, it is more useful to explain by answering “What differences between the two inputs explain this prediction?”. We introduce a technique to generate contrastive phrasal highlights that explain the predictions of a semantic divergence model via phrase alignment guided erasure. We show that the resulting highlights match human rationales of cross-lingual semantic differences better than popular post-hoc saliency techniques and that they successfully help people detect fine-grained meaning differences in human translations and critical machine translation errors. 
    more » « less
  2. Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to discriminate perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations. 
    more » « less
  3. Neural NLP models are increasingly accurate but are imperfect and opaque—they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit’s flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at https://allennlp.org/interpret. 
    more » « less
  4. Purpose: To determine if saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could potentially lead to misleading interpretations, using Prediction-Saliency Correlation (PSC) for evaluating the sensitivity and robustness of saliency methods. Materials and Methods: In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vender were systematically evaluated on 191,229 chest radiographs from the CheXpert dataset(1,2) and 7,022 MRI images of human brain tumor classification dataset(3). Two radiologists performed a reader study on 270 chest radiographs pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods. Results: Leveraging locally trained model parameters, we revealed the saliency methods’ low sensitivity (maximum PSC = 0.25, 95% CI: 0.12, 0.38) and weak robustness (maximum PSC = 0.12, 95% CI: 0.0, 0.25) on the CheXpert dataset. Without model specifics, we also showed that the saliency maps from a commercial prototype could be irrelevant to the model output (area under the receiver operating characteristic curve dropped by 8.6% without affecting the saliency map). The human observer studies confirmed that is difficult for experts to identify the perturbed images, who had less than 44.8% correctness. Conclusion: Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability. Abbreviations: AI = artificial intelligence, PSC = prediction-saliency correlation, AUC = area under the receiver operating characteristic curve, SSIM = structural similarity index measure. Summary: Systematic evaluation of saliency methods through subtle perturbations in chest radiographs and brain MRI images demonstrated low sensitivity and robustness of those methods, warranting caution when using saliency methods that may misrepresent changes in AI model prediction. 
    more » « less
  5. Neural networks (NNs) enable precise modeling of complicated geophysical phenomena but can be sensitive to small input changes. In this work, we present a new method for analyzing this instability in NNs. We focus our analysis on adversarial examples, test‐time inputs with carefully crafted human‐imperceptible perturbations that expose the worst‐case instability in a model's predictions. Our stability analysis is based on a low‐rank expansion of NNs on a fixed input, and we apply our analysis to a NN model for tsunami early warning which takes geodetic measurements as the input and forecasts tsunami waveforms. The result is an improved description of local stability that explains adversarial examples generated by a standard gradient‐based algorithm, and allows the generation of other comparable examples. Our analysis can predict whether noise in the geodetic input will produce an unstable output, and identifies a potential approach to filtering the input that enable more robust forecasting 
    more » « less