skip to main content

This content will become publicly available on June 27, 2024

Title: SimFair: A Unified Framework for Fairness-Aware Multi-Label Classification

Recent years have witnessed increasing concerns towards unfair decisions made by machine learning algorithms. To improve fairness in model decisions, various fairness notions have been proposed and many fairness-aware methods are developed. However, most of existing definitions and methods focus only on single-label classification. Fairness for multi-label classification, where each instance is associated with more than one labels, is still yet to establish. To fill this gap, we study fairness-aware multi-label classification in this paper. We start by extending Demographic Parity (DP) and Equalized Opportunity (EOp), two popular fairness notions, to multi-label classification scenarios. Through a systematic study, we show that on multi-label data, because of unevenly distributed labels, EOp usually fails to construct a reliable estimate on labels with few instances. We then propose a new framework named Similarity s-induced Fairness (sγ -SimFair). This new framework utilizes data that have similar labels when estimating fairness on a particular label group for better stability, and can unify DP and EOp. Theoretical analysis and experimental results on real-world datasets together demonstrate the advantage of sγ -SimFair over existing methods on multi-label classification tasks.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Page Range / eLocation ID:
14338 to 14346
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the widespread use of machine learning systems in our daily lives, it is important to consider fairness as a basic requirement when designing these systems, especially when the systems make life-changing decisions, e.g., \textit{COMPAS} algorithm helps judges decide whether to release an offender. For another thing, due to the cheap but imperfect data collection methods, such as crowdsourcing and web crawling, label noise is ubiquitous, which unfortunately makes fairness-aware algorithms even more prejudiced than fairness-unaware ones, and thereby harmful. To tackle these problems, we provide general frameworks for learning fair classifiers with \textit{instance-dependent label noise}. For statistical fairness notions, we rewrite the classification risk and the fairness metric in terms of noisy data and thereby build robust classifiers. For the causality-based fairness notion, we exploit the internal causal structure of data to model the label noise and \textit{counterfactual fairness} simultaneously. Experimental results demonstrate the effectiveness of the proposed methods on real-world datasets with controllable synthetic label noise. 
    more » « less
  2. Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mitigate discrimination issues, these algorithms can still leak sensitive information, such as individuals’ health or financial records. Utilizing the notion of differential privacy (DP), prior works aimed at developing learning algorithms that are both private and fair. However, existing algorithms for DP fair learning are either not guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed to converge. Here, the term “stochastic" refers to the fact that our proposed algorithm converges even when minibatches of data are used at each iteration (i.e. stochastic optimization). Our framework is flexible enough to permit different fairness notions, including demographic parity and equalized odds. In addition, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algorithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines, and can be applied to larger scale problems with non-binary target/sensitive attributes. 
    more » « less
  3. Abstract

    Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.

    more » « less
  4. null (Ed.)
    Multi-label text classification refers to the problem of assigning each given document its most relevant labels from a label set. Commonly, the metadata of the given documents and the hierarchy of the labels are available in real-world applications. However, most existing studies focus on only modeling the text information, with a few attempts to utilize either metadata or hierarchy signals, but not both of them. In this paper, we bridge the gap by formalizing the problem of metadata-aware text classification in a large label hierarchy (e.g., with tens of thousands of labels). To address this problem, we present the MATCH solution—an end-to-end framework that leverages both metadata and hierarchy information. To incorporate metadata, we pre-train the embeddings of text and metadata in the same space and also leverage the fully-connected attentions to capture the interrelations between them. To leverage the label hierarchy, we propose different ways to regularize the parameters and output probability of each child label by its parents. Extensive experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH over the state-of-the-art deep learning baselines. 
    more » « less
  5. Recurrent Classifier Chains (RCCs) are a leading approach for multi-label classification as they directly model the interdependencies between classes. Unfortunately, existing RCCs assume that every training instance is completely labeled with all its ground truth classes. In practice often only a subset of an instance's labels are annotated, while the annotations for other classes are missing. RCCs fail in this missing label scenario, predicting many false negatives and potentially missing important classes. In this work, we propose Robust-RCC, the first strategy for tackling this open problem of RCCs failing for multi-label missing-label data. Robust-RCC is a new type of deep recurrent classifier chain empowered to model inter-class relationships essential for predicting the complete label set most likely to match the ground truth. The key to Robust-RCC is the design of the Multi Incomplete Label Risk (MILR) function, which we prove to be equal in expectation to the true risk of the ground truth full label set despite being computed from incompletely labeled data. Our experimental study demonstrates that Robust-RCC consistently beats six state-of-of-the-art methods by as much as 30% in predicting the true labels. 
    more » « less