IntroductionMachine learning (ML) algorithms have been heralded as promising solutions to the realization of assistive systems in digital healthcare, due to their ability to detect fine-grain patterns that are not easily perceived by humans. Yet, ML algorithms have also been critiqued for treating individuals differently based on their demography, thus propagating existing disparities. This paper explores gender and race bias in speech-based ML algorithms that detect behavioral and mental health outcomes. MethodsThis paper examines potential sources of bias in the data used to train the ML, encompassing acoustic features extracted from speech signals and associated labels, as well as in the ML decisions. The paper further examines approaches to reduce existing bias via using the features that are the least informative of one’s demographic information as the ML input, and transforming the feature space in an adversarial manner to diminish the evidence of the demographic information while retaining information about the focal behavioral and mental health state. ResultsResults are presented in two domains, the first pertaining to gender and race bias when estimating levels of anxiety, and the second pertaining to gender bias in detecting depression. Findings indicate the presence of statistically significant differences in both acoustic features and labels among demographic groups, as well as differential ML performance among groups. The statistically significant differences present in the label space are partially preserved in the ML decisions. Although variations in ML performance across demographic groups were noted, results are mixed regarding the models’ ability to accurately estimate healthcare outcomes for the sensitive groups. DiscussionThese findings underscore the necessity for careful and thoughtful design in developing ML models that are capable of maintaining crucial aspects of the data and perform effectively across all populations in digital healthcare applications.
more »
« less
JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data Scientists
Current algorithmic fairness tools focus on auditing completed models, neglecting the potential downstream impacts of iterative decisions about cleaning data and training machine learning models. In response, we developed Retrograde, a JupyterLab environment extension for Python that generates real-time, contextual notifications for data scientists about decisions they are making regarding protected classes, proxy variables, missing data, and demographic differences in model performance. Our novel framework uses automated code analysis to trace data provenance in JupyterLab, enabling these notifications. In a between-subjects online experiment, 51 data scientists constructed loan-decision models with Retrograde providing notifications continuously throughout the process, only at the end, or never. Retrograde’s notifications successfully nudged participants to account for missing data, avoid using protected classes as predictors, minimize demographic differences in model performance, and exhibit healthy skepticism about their models.
more »
« less
- PAR ID:
- 10493968
- Publisher / Repository:
- Proceedings of the CHI Conference on Human Factors in Computing Systems
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A common challenge in developmental research is the amount of incomplete and missing data that occurs from respondents failing to complete tasks or questionnaires, as well as from disengaging from the study (i.e., attrition). This missingness can lead to biases in parameter estimates and, hence, in the interpretation of findings. These biases can be addressed through statistical techniques that adjust for missing data, such as multiple imputation. Although multiple imputation is highly effective, it has not been widely adopted by developmental scientists given barriers such as lack of training or misconceptions about imputation methods. Utilizing default methods within statistical software programs like listwise deletion is common but may introduce additional bias. This manuscript is intended to provide practical guidelines for developmental researchers to follow when examining their data for missingness, making decisions about how to handle that missingness and reporting the extent of missing data biases and specific multiple imputation procedures in publications.more » « less
-
The database community has largely focused on providing improved transaction management and query capabilities over records (and generalizations thereof). Yet such capabilities address only a small part of today’s data science tasks, which are often much more focused on discovery, linking, comparative analysis, and collaboration across holistic datasets and data products. Data scientists frequently point to a strong need for data management — with respect to their many datasets and data products. We propose the development of the dataset relationship management system to support five main classes of operations on datasets: reuse of schema, data, curation, and work across many datasets; revelation of provenance, context, and assumptions; rapid revision of data and processing steps; system-assisted retargeting of computation to alternative execution environments; and metrics to reward individuals’ contributions to the broader data ecosystem. We argue that the recent adoption of computational notebooks (particularly JupyterLab and Jupyter Notebook), as a unified interface over data tools, provides an ideal way of gathering detailed information about how data is being used, i.e., of transparently capturing dataset provenance and relationships, and thus such notebooks provide an attractive mechanism for integrating dataset relationship management into the data science ecosystem. We briefly outline our experiences in building towards JuNEAU, the first prototype DRMS.more » « less
-
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a business necessity''. Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.more » « less
-
This study examines the correlation of physics conceptual inventory pretest scores with post-instruction achievement measures (post-test scores, test averages, and course grades). The correlation for demographic groups in the minority in the physics classes studied (women, underrepresented racial/enthic students, first generation college students, and rural students) were compared with their majority peers. Three conceptual inventories were examined: the Force and Motion Conceptual Evaluation (FMCE) (N = 2450), the Force Concept Inventory (FCI) (N = 2373) and the CSEM (N1 = 1796, N2 = 2537). While many of the correlations were similar, for some of the demographic groups, the correlations were substantially different. There was little consistency in the differences measured. In most cases where the correlations differed, the correlation for the group in the minority was the smaller. As such, pretest scores may not predict course performance for some minority demographic groups as accurately as they predict outcomes for majority students. The pattern of correlation differences did not appear to be related to the size of the pretest score. If pretest scores are used for instructional decisions that have academic consequences, instructors should be aware of these potential inaccuracies and ensure the pretest used is equally valid for all students.more » « less
An official website of the United States government

