skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Title: On the Discrimination Risk of Mean Aggregation Feature Imputation in Graphs
In human networks, nodes belonging to a marginalized group often have a disproportionate rate of unknown or missing features. This, in conjunction with graph structure and known feature biases, can cause graph feature imputation algorithms to predict values for unknown features that make the marginalized group's feature values more distinct from the the dominant group's feature values than they are in reality. We call this distinction the discrimination risk. We prove that a higher discrimination risk can amplify the unfairness of a machine learning model applied to the imputed data. We then formalize a general graph feature imputation framework called mean aggregation imputation and theoretically and empirically characterize graphs in which applying this framework can yield feature values with a high discrimination risk. We propose a simple algorithm to ensure mean aggregation-imputed features provably have a low discrimination risk, while minimally sacrificing reconstruction error (with respect to the imputation objective). We evaluate the fairness and accuracy of our solution on synthetic and real-world credit networks.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Machine learning with missing data has been approached in two different ways, including feature imputation where missing feature values are estimated based on observed values and label prediction where downstream labels are learned directly from incomplete data. However, existing imputation models tend to have strong prior assumptions and cannot learn from downstream tasks, while models targeting label prediction often involve heuristics and can encounter scalability issues. Here we propose GRAPE, a graph-based framework for feature imputation as well as label prediction. GRAPE tackles the missing data problem using a graph representation, where the observations and features are viewed as two types of nodes in a bipartite graph, and the observed feature values as edges. Under the GRAPE framework, the feature imputation is formulated as an edge-level prediction task and the label prediction as a node-level prediction task. These tasks are then solved with Graph Neural Networks. Experimental results on nine benchmark datasets show that GRAPE yields 20% lower mean absolute error for imputation tasks and 10% lower for label prediction tasks, compared with existing state-of-the-art methods. 
    more » « less
  2. Raw datasets collected for fake news detection usually contain some noise such as missing values. In order to improve the performance of machine learning based fake news detection, a novel data preprocessing method is proposed in this paper to process the missing values. Specifically, we have successfully handled the missing values problem by using data imputation for both categorical and numerical features. For categorical features, we imputed missing values with the most frequent value in the columns. For numerical features, the mean value of the column is used to impute numerical missing values. In addition, TF-IDF vectorization is applied in feature extraction to filter out irrelevant features. Experimental results show that Multi-Layer Perceptron (MLP) classifier with the proposed data preprocessing method outperforms baselines and improves the prediction accuracy by more than 15%. 
    more » « less
  3. We investigate the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e.g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. In this paper, we first theoretically analyze different sources of discrimination risks when training with an imputed dataset. Then, we propose an integrated approach based on decision trees that does not require a separate process of imputation and learning. Instead, we train a tree with missing incorporated as attribute (MIA), which does not require explicit imputation, and we optimize a fairness-regularized objective function. We demonstrate that our approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets. 
    more » « less
  4. null (Ed.)
    Abstract Motivation Mapping genetic interactions (GIs) can reveal important insights into cellular function and has potential translational applications. There has been great progress in developing high-throughput experimental systems for measuring GIs (e.g. with double knockouts) as well as in defining computational methods for inferring (imputing) unknown interactions. However, existing computational methods for imputation have largely been developed for and applied in baker’s yeast, even as experimental systems have begun to allow measurements in other contexts. Importantly, existing methods face a number of limitations in requiring specific side information and with respect to computational cost. Further, few have addressed how GIs can be imputed when data are scarce. Results In this article, we address these limitations by presenting a new imputation framework, called Extensible Matrix Factorization (EMF). EMF is a framework of composable models that flexibly exploit cross-species information in the form of GI data across multiple species, and arbitrary side information in the form of kernels (e.g. from protein–protein interaction networks). We perform a rigorous set of experiments on these models in matched GI datasets from baker’s and fission yeast. These include the first such experiments on genome-scale GI datasets in multiple species in the same study. We find that EMF models that exploit side and cross-species information improve imputation, especially in data-scarce settings. Further, we show that EMF outperforms the state-of-the-art deep learning method, even when using strictly less data, and incurs orders of magnitude less computational cost. Availability Implementations of models and experiments are available at: Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Simsekler, Mecit Can (Ed.)
    Missing data presents a challenge for machine learning applications specifically when utilizing electronic health records to develop clinical decision support systems. The lack of these values is due in part to the complex nature of clinical data in which the content is personalized to each patient. Several methods have been developed to handle this issue, such as imputation or complete case analysis, but their limitations restrict the solidity of findings. However, recent studies have explored how using some features as fully available privileged information can increase model performance including in SVM. Building on this insight, we propose a computationally efficient kernel SVM-based framework ( l 2 -SVMp+) that leverages partially available privileged information to guide model construction. Our experiments validated the superiority of l 2 -SVMp+ over common approaches for handling missingness and previous implementations of SVMp+ in both digit recognition, disease classification and patient readmission prediction tasks. The performance improves as the percentage of available privileged information increases. Our results showcase the capability of l 2 -SVMp+ to handle incomplete but important features in real-world medical applications, surpassing traditional SVMs that lack privileged information. Additionally, l 2 -SVMp+ achieves comparable or superior model performance compared to imputed privileged features. 
    more » « less