NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fairness Without Harm: An Influence-Guided Active Sampling Approach

Pang, Jinlong; Wang, Jialu; Zhu, Zhaowei; Yao, Yuanshun; Qian, Chen; Liu, Yang (December 2024, Neural Information Processing Systems (NeurIPS) 2024)

Full Text Available
Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models

Zhu, Zhaowei; Wang, Jialu; Cheng, Hao; Liu, Yang (May 2024, International Conference on Learning Representations)

Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets.
more » « less
Full Text Available
Procedural Fairness Through Decoupling Objectionable Data Generating Components

Tang, Zeyu; Wang, Jialu; Liu, Yang; Spirtes, Peter; Zhang, Kun (May 2024, International Conference on Learning Representations (ICLR) 2024)

We reveal and address the frequently overlooked yet important issue of disguised procedural unfairness, namely, the potentially inadvertent alterations on the behavior of neutral (i.e., not problematic) aspects of data generating process, and/or the lack of procedural assurance of the greatest benefit of the least advantaged individuals. Inspired by John Rawls's advocacy for pure procedural justice, we view automated decision-making as a microcosm of social institutions, and consider how the data generating process itself can satisfy the requirements of procedural fairness. We propose a framework that decouples the objectionable data generating components from the neutral ones by utilizing reference points and the associated value instantiation rule. Our findings highlight the necessity of preventing disguised procedural unfairness, drawing attention not only to the objectionable data generating components that we aim to mitigate, but also more importantly, to the neutral components that we intend to keep unaffected.
more » « less
Full Text Available
Learning and optimization under epistemic uncertainty with Bayesian hybrid models

https://doi.org/10.1016/j.compchemeng.2023.108430

Eugene, Elvis A.; Jones, Kyla D.; Gao, Xian; Wang, Jialu; Dowling, Alexander W. (November 2023, Computers & Chemical Engineering)

Hybrid (i.e., grey-box) models are a powerful and flexible paradigm for predictive science and engineering. Grey-box models use data-driven constructs to incorporate unknown or computationally intractable phenomena into glass-box mechanistic models. The pioneering work of statisticians Kennedy and O’Hagan introduced a new paradigm to quantify epistemic (i.e., model-form) uncertainty. While popular in several engineering disciplines, prior work using Kennedy–O’Hagan hybrid models focuses on prediction with accurate uncertainty estimates. This work demonstrates computational strategies to deploy Bayesian hybrid models for optimization under uncertainty. Specifically, the posterior distributions of Bayesian hybrid models provide a principled uncertainty set for stochastic programming, chance-constrained optimization, or robust optimization. Through two illustrative case studies, we demonstrate the efficacy of hybrid models, composed of a structurally inadequate glass-box model and Gaussian process bias correction term, for decision-making using limited training data. From these case studies, we develop recommended best practices and explore the trade-offs between different hybrid model architectures.
more » « less
Full Text Available
Data science for thermodynamic modeling: Case study for ionic liquid and hydrofluorocarbon refrigerant mixtures

https://doi.org/10.1016/j.fluid.2023.113833

Befort, Bridgette J.; Garciadiego, Alejandro; Wang, Jialu; Wang, Ke; Franco, Gabriela; Maginn, Edward J.; Dowling, Alexander W. (September 2023, Fluid Phase Equilibria)

Full Text Available
Fairness Transferability Subject to Bounded Distribution Shift

Chen, Yatong; Raab, Reilly; Wang, Jialu; Liu, Yang (December 2022, Neural Information Processing Systems (NeurIPS), 2022.)

Full Text Available
Can Less be More? When Increasing-to-Balancing Label Noise Rates Considered Beneficial

Liu, Yang; Wang, Jialu (December 2021, 35th Conference on Neural Information Processing Systems (NeurIPS 2021))

Full Text Available
Understanding Instance-Level Impact of Fairness Constraints

Wang, Jialu; Wang, Xin; Liu, Yang (January 2022, International Conference on Machine Learning (ICML))

A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias. Their impacts have been largely evaluated for different groups of populations corresponding to a set of sensitive attributes, such as race or gender. Nonetheless, the community has not observed sufficient explorations for how imposing fairness constraints fare at an instance level. Building on the concept of influence function, a measure that characterizes the impact of a training example on the target model and its predictive performance, this work studies the influence of training examples when fairness constraints are imposed. We find out that under certain assumptions, the influence function with respect to fairness constraints can be decomposed into a kernelized combination of training examples. One promising application of the proposed fairness influence function is to identify suspicious training examples that may cause model discrimination by ranking their influence scores. We demonstrate with extensive experiments that training on a subset of weighty data examples leads to lower fairness violations with a trade-off of accuracy.
more » « less
Full Text Available
Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features

Zhu, Zhaowei; Wang, Jialu; Liu, Yang (January 2022, International Conference on Machine Learning)

The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial for designing statistically robust solutions. Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations. We observe that tasks with lower-quality features fail to meet the anchor-point or clusterability condition, due to the coexistence of both uninformative and informative representations. To handle this issue, we propose a generic and practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. This improvement is crucial to identifying and estimating the label noise transition matrix. The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones. We prove that the celebrated f-mutual information measure can often preserve the order when calculated using noisy labels. We then build our transition matrix estimator using this distilled version of features. The necessity and effectiveness of the proposed method are also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features.
more » « less
Full Text Available
Fairness Transferability Subject to Bounded Distribution Shift

Chen, Yatong; Raab, Reilly; Wang, Jialu; Liu, Yang (January 2022, Neural Information Processing Systems (NeurIPS))

Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied.
more » « less
Full Text Available

« Prev Next »

Search for: All records