skip to main content

Title: Explainable Models with Consistent Interpretations
Given the widespread deployment of black box deep neural networks in computer vision applications, the interpretability aspect of these black box systems has recently gained traction. Various methods have been proposed to explain the results of such deep neural networks. However, some recent works have shown that such explanation methods are biased and do not produce consistent interpretations. Hence, rather than introducing a novel explanation method, we learn models that are encouraged to be interpretable given an explanation method. We use Grad-CAM as the explanation algorithm and encourage the network to learn consistent interpretations along with maximizing the log-likelihood of the correct class. We show that our method outperforms the baseline on the pointing game evaluation on ImageNet and MS-COCO datasets respectively. We also introduce new evaluation metrics that penalize the saliency map if it lies outside the ground truth bounding box or segmentation mask, and show that our method outperforms the baseline on these metrics as well. Moreover, our model trained with interpretation consistency generalizes to other explanation algorithms on all the evaluation metrics.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
35th AAAI Conference on Artificial Intelligence (AAAI)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Advanced machine learning models have received wide attention in assisting medical decision making due to the greater accuracy they can achieve. However, their limited interpretability imposes barriers for practitioners to adopt them. Recent advancements in interpretable machine learning tools allow us to look inside the black box of advanced prediction methods to extract interpretable models while maintaining similar prediction accuracy, but few studies have investigated the specific hospital readmission prediction problem with this spirit.


    Our goal is to develop a machine-learning (ML) algorithm that can predict 30- and 90- day hospital readmissions as accurately as black box algorithms while providing medically interpretable insights into readmission risk factors. Leveraging a state-of-art interpretable ML model, we use a two-step Extracted Regression Tree approach to achieve this goal. In the first step, we train a black box prediction algorithm. In the second step, we extract a regression tree from the output of the black box algorithm that allows direct interpretation of medically relevant risk factors. We use data from a large teaching hospital in Asia to learn the ML model and verify our two-step approach.


    The two-step method can obtain similar prediction performance as the best black box model, such as Neural Networks, measured by three metrics: accuracy, the Area Under the Curve (AUC) and the Area Under the Precision-Recall Curve (AUPRC), while maintaining interpretability. Further, to examine whether the prediction results match the known medical insights (i.e., the model is truly interpretable and produces reasonable results), we show that key readmission risk factors extracted by the two-step approach are consistent with those found in the medical literature.


    The proposed two-step approach yields meaningful prediction results that are both accurate and interpretable. This study suggests a viable means to improve the trust of machine learning based models in clinical practice for predicting readmissions through the two-step approach.

    more » « less
  2. Testing deep neural networks (DNNs) has garnered great interest in the recent years due to their use in many applications. Black-box test adequacy measures are useful for guiding the testing process in covering the input domain. However, the absence of input specifications makes it challenging to apply black-box test adequacy measures in DNN testing. The Input Distribution Coverage (IDC) framework addresses this challenge by using a variational autoencoder to learn a low dimensional latent representation of the input distribution, and then using that latent space as a coverage domain for testing. IDC applies combinatorial interaction testing on a partitioning of the latent space to measure test adequacy. Empirical evaluation demonstrates that IDC is cost-effective, capable of detecting feature diversity in test inputs, and more sensitive than prior work to test inputs generated using different DNN test generation methods. The findings demonstrate that IDC overcomes several limitations of white-box DNN coverage approaches by discounting coverage from unrealistic inputs and enabling the calculation of test adequacy metrics that capture the feature diversity present in the input space of DNNs. 
    more » « less
  3. Neural network applications have become popular in both enterprise and personal settings. Network solutions are tuned meticulously for each task, and designs that can robustly resolve queries end up in high demand. As the commercial value of accurate and performant machine learning models increases, so too does the demand to protect neural architectures as confidential investments. We explore the vulnerability of neural networks deployed as black boxes across accelerated hardware through electromagnetic side channels. We examine the magnetic flux emanating from a graphics processing unit’s power cable, as acquired by a cheap $3 induction sensor, and find that this signal betrays the detailed topology and hyperparameters of a black-box neural network model. The attack acquires the magnetic signal for one query with unknown input values, but known input dimensions. The network reconstruction is possible due to the modular layer sequence in which deep neural networks are evaluated. We find that each layer component’s evaluation produces an identifiable magnetic signal signature, from which layer topology, width, function type, and sequence order can be inferred using a suitably trained classifier and a joint consistency optimization based on integer programming. We study the extent to which network specifications can be recovered, and consider metrics for comparing network similarity. We demonstrate the potential accuracy of this side channel attack in recovering the details for a broad range of network architectures, including random designs. We consider applications that may exploit this novel side channel exposure, such as adversarial transfer attacks. In response, we discuss countermeasures to protect against our method and other similar snooping techniques. 
    more » « less
  4. Deep neural networks are widely used for text pair classification tasks such as as adhoc retrieval. These deep neural networks are not inherently interpretable and require additional efforts to get rationale behind their decisions. Existing explanation models are not yet capable of inducing alignments between the query terms and the document terms -- which part of the document rationales are responsible for which part of the query? In this paper, we study how the input perturbations can be used to infer or evaluate alignments between the query and document spans, which best explain the black-box ranker’s relevance prediction. We use different perturbation strategies and accordingly propose a set of metrics to evaluate the faithfulness of alignment rationales to the model. Our experiments show that defined metrics based on substitution-based perturbation are more successful in preferring higher-quality alignments, compared to the deletion-based metrics. 
    more » « less
  5. Host-based Intrusion Detection Systems (HIDS) automatically detect events that indicate compromise by adversarial applications. HIDS are generally formulated as analyses of sequences of system events such as bash commands or system calls. Anomaly-based approaches to HIDS leverage models of normal (a.k.a. baseline) system behavior to detect and report abnormal events and have the advantage of being able to detect novel attacks. In this article, we develop a new method for anomaly-based HIDS using deep learning predictions of sequence-to-sequence behavior in system calls. Our proposed method, called the ALAD algorithm, aggregates predictions at the application level to detect anomalies. We investigate the use of several deep learning architectures, including WaveNet and several recurrent networks. We show that ALAD empowered with deep learning significantly outperforms previous approaches. We train and evaluate our models using an existing dataset, ADFA-LD, and a new dataset of our own construction, PLAID. As deep learning models are black box in nature, we use an alternate approach, allotaxonographs, to characterize and understand differences in baseline vs. attack sequences in HIDS datasets such as PLAID. 
    more » « less