skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation
Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, an approach to audit such models without probing the black-box model API or pre-defining features to audit. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by the black-box models. We compare the mimic model trained with distillation to a second, un-distilled transparent model trained on ground truth outcomes, and use differences between the two models to gain insight into the black-box model. We demonstrate the approach on four data sets: COMPAS, Stop-and-Frisk, Chicago Police, and Lending Club. We also propose a statistical test to determine if a data set is missing key features used to train the black-box model. Our test finds that the ProPublica data is likely missing key feature(s) used in COMPAS.  more » « less
Award ID(s):
1712554
PAR ID:
10298501
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society
Page Range / eLocation ID:
303 to 310
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep learning models have demonstrated impressive accuracy in predicting acute kidney injury (AKI), a condition affecting up to 20% of ICU patients, yet their black-box nature prevents clinical adoption in high-stakes critical care settings. While existing interpretability methods like SHAP, LIME, and attention mechanisms can identify important features, they fail to capture the temporal dynamics essential for clinical decision-making, and are unable to communicate when specific risk factors become critical in a patient's trajectory. This limitation is particularly problematic in the ICU, where the timing of interventions can significantly impact patient outcomes. We present a novel interpretable framework that brings temporal awareness to deep learning predictions for AKI. Our approach introduces three key innovations: (1) a latent convolutional concept bottleneck that learns clinically meaningful patterns from ICU time-series without requiring manual concept annotation, leveraging Conv1D layers to capture localized temporal patterns like sudden physiological changes; (2) Temporal Concept Tracing (TCT), a gradient-based method that identifies not only which risk factors matter but precisely when they become critical addressing the fundamental question of temporal relevance missing from current XAI techniques; and (3) integration with MedAlpaca to generate structured, time-aware clinical explanations that translate model insights into actionable bedside guidance. We evaluate our framework on MIMIC-IV data, demonstrating that our approach performs better than existing explainability frameworks, Occlusion and LIME, in terms of the comprehensiveness score, sufficiency score, and processing time. The proposed method also better captures risk factors inflection points for patients timelines compared to conventional concept bottleneck methods, including dense layer and attention mechanism. This work represents the first comprehensive solution for interpretable temporal deep learning in critical care that addresses both the what and when of clinical risk factors. By making AKI predictions transparent and temporally contextualized, our framework bridges the gap between model accuracy and clinical utility, offering a path toward trustworthy AI deployment in time-sensitive healthcare settings. 
    more » « less
  2. Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence. 
    more » « less
  3. Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence. 
    more » « less
  4. Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence. 
    more » « less
  5. null (Ed.)
    With the increasing adoption of predictive models trained using machine learning across a wide range of high-stakes applications, e.g., health care, security, criminal justice, finance, and education, there is a growing need for effective techniques for explaining such models and their predictions. We aim to address this problem in settings where the predictive model is a black box; That is, we can only observe the response of the model to various inputs, but have no knowledge about the internal structure of the predictive model, its parameters, the objective function, and the algorithm used to optimize the model. We reduce the problem of interpreting a black box predictive model to that of estimating the causal effects of each of the model inputs on the model output, from observations of the model inputs and the corresponding outputs. We estimate the causal effects of model inputs on model output using variants of the Rubin Neyman potential outcomes framework for estimating causal effects from observational data. We show how the resulting causal attribution of responsibility for model output to the different model inputs can be used to interpret the predictive model and to explain its predictions. We present results of experiments that demonstrate the effectiveness of our approach to the interpretation of black box predictive models via causal attribution in the case of deep neural network models trained on one synthetic data set (where the input variables that impact the output variable are known by design) and two real-world data sets: Handwritten digit classification, and Parkinson's disease severity prediction. Because our approach does not require knowledge about the predictive model algorithm and is free of assumptions regarding the black box predictive model except that its input-output responses be observable, it can be applied, in principle, to any black box predictive model. 
    more » « less