NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of Minorities

https://doi.org/10.14778/3681954.3682014

Erfanian, Mahdi; Jagadish, H V; Asudeh, Abolfazl (July 2024, Proceedings of the VLDB Endowment)

Potential harms from the under-representation of minorities in data, particularly in multi-modal settings, is a well-recognized concern. While there has been extensive effort in detecting such under-representation, resolution has remained a challenge. With recent generative AI advancements, large language and foundation models have emerged as versatile tools across various domains. In this paper, we propose Chameleon, a system that efficiently utilizes these tools to augment a dataset with minimal addition of synthetically generated tuples to enhance the coverage of the under-represented groups. Our system applies quality and outlier-detection tests to ensure the quality and semantic integrity of the generated tuples. In order to minimize the rejection chance of the generated tuples, we propose multiple strategies to provide a guide for the foundation model. Our experiment results, in addition to confirming the efficiency of our proposed algorithms, illustrate our approach's effectiveness, as the model's unfairness in a downstream task significantly dropped after data repair using Chameleon.
more » « less
Full Text Available
CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics

https://doi.org/10.14778/3675034.3675041

Cai, Qingpeng; Zheng, Kaiping; Jagadish, H V; Ooi, Beng Chin; Yip, James (June 2024, Proceedings of the VLDB Endowment)

Cohort studies are of significant importance in the field of healthcare analytics. However, existing methods typically involve manual, labor-intensive, and expert-driven pattern definitions or rely on simplistic clustering techniques that lack medical relevance. Automating cohort studies with interpretable patterns has great potential to facilitate healthcare analytics and data management but remains an unmet need in prior research efforts. In this paper, we present a cohort auto-discovery framework for interpretable healthcare analytics. It focuses on the effective identification, representation, and exploitation of cohorts characterized by medically meaningful patterns. In the framework, we propose CohortNet, a core model that can learn fine-grained patient representations by separately processing each feature, considering both individual feature trends and feature interactions at each time step. Subsequently, it employs K-Means in an adaptive manner to classify each feature into distinct states and a heuristic cohort exploration strategy to effectively discover substantial cohorts with concrete patterns. For each identified cohort, it learns comprehensive cohort representations with credible evidence through associated patient retrieval. Ultimately, given a new patient, CohortNet can leverage relevant cohorts with distinguished importance which can provide a more holistic understanding of the patient's conditions. Extensive experiments on three real-world datasets demonstrate that it consistently outperforms state-of-the-art approaches, resulting in improvements in AUC-PR scores ranging from 2.8% to 4.1%, and offers interpretable insights from diverse perspectives in a top-down fashion.
more » « less
Full Text Available
ARTS: A System for Aggregate Related Table Search

https://doi.org/10.1109/ICDE60146.2024.00428

Xing, Junjie; Jagadish, H V (May 2024, IEEE)

Existing table search techniques define table relatedness with unionablility and/or joinability. While these are valuable, they do not suffice for most data analysis tasks that involve numerical data, which is often aggregated over geographical, temporal, or other groups. In this demonstration, we showcase ARTS, a novel table search system centered on the unique concept of aggregate relatedness. By leveraging pre-trained language models, ARTS offers a superior column semantics understanding capability, with good labels created for both textual and numerical columns. This demonstration will offer attendees hands-on interaction with our system, revealing its potential in effectively addressing real-world data analysis challenges.
more » « less
Full Text Available
Reverse Regret Query

https://doi.org/10.1109/ICDE60146.2024.00314

Wang, Weicheng; Wong, Raymond Chi-Wing; Jagadish, H V; Xie, Min (May 2024, IEEE)

Full Text Available
Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach

https://doi.org/10.1109/ICDE60146.2024.00171

Lin, Yin; Gupta, Samika; Jagadish, H V (May 2024, IEEE)

Full Text Available
Data-Driven Insight Synthesis for Multi-Dimensional Data

https://doi.org/10.14778/3641204.3641211

Xing, Junjie; Wang, Xinyu; Jagadish, H V (January 2024, Proceedings of the VLDB Endowment)

Exploratory data analysis can uncover interesting data insights from data. Current methods utilize interestingness measures designed based on system designers' perspectives, thus inherently restricting the insights to their defined scope. These systems, consequently, may not adequately represent a broader range of user interests. Furthermore, most existing approaches that formulate interestingness measure are rule-based, which makes them inevitably brittle and often requires holistic re-design when new user needs are discovered. This paper presents a data-driven technique for deriving an interestingness measure that learns from annotated data. We further develop an innovative annotation algorithm that significantly reduces the annotation cost, and an insight synthesis algorithm based on the Markov Chain Monte Carlo method for efficient discovery of interesting insights. We consolidate these ideas into a system. Our experimental outcomes and user studies demonstrate that DAISY can effectively discover a broad range of interesting insights, thereby substantially advancing the current state-of-the-art.
more » « less
Full Text Available
Representation Bias in Data: A Survey on Identification and Resolution Techniques

https://doi.org/10.1145/3588433

Shahbazi, Nima; Lin, Yin; Asudeh, Abolfazl; Jagadish, H. V. (December 2023, ACM Computing Surveys)

Data-driven algorithms are only as good as the data they work with, while datasets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons, ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that “bias in, bias out,” one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This article reviews the literature on identifying and resolving representation bias as a feature of a dataset, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.
more » « less
Full Text Available
Query Refinement for Diversity Constraint Satisfaction

https://doi.org/10.14778/3626292.3626295

Li, Jinyang; Moskovitch, Yuval; Stoyanovich, Julia; Jagadish, H. V. (October 2023, Proceedings of the VLDB Endowment)

Diversity, group representation, and similar needs often apply to query results, which in turn require constraints on the sizes of various subgroups in the result set. Traditional relational queries only specify conditions as part of the query predicate(s), and do not support such restrictions on the output. In this paper, we study the problem of modifying queries to have the result satisfy constraints on the sizes of multiple subgroups in it. This problem, in the worst case, cannot be solved in polynomial time. Yet, with the help of provenance annotation, we are able to develop a query refinement method that works quite efficiently, as we demonstrate through extensive experiments.
more » « less
Full Text Available
Erica: Query Refinement for Diversity Constraint Satisfaction

https://doi.org/10.14778/3611540.3611623

Li, Jinyang; Silberstein, Alon; Moskovitch, Yuval; Stoyanovich, Julia; Jagadish, H. V. (August 2023, Proceedings of the VLDB Endowment)

Relational queries are commonly used to support decision making in critical domains like hiring and college admissions. For example, a college admissions officer may need to select a subset of the applicants for in-person interviews, who individually meet the qualification requirements (e.g., have a sufficiently high GPA) and are collectively demographically diverse (e.g., include a sufficient number of candidates of each gender and of each race). However, traditional relational queries only support selection conditions checked against each input tuple, and they do not support diversity conditions checked against multiple, possibly overlapping, groups of output tuples. To address this shortcoming, we present Erica, an interactive system that proposes minimal modifications for selection queries to have them satisfy constraints on the cardinalities of multiple groups in the result. We demonstrate the effectiveness of Erica using several real-life datasets and diversity requirements.
more » « less
Full Text Available
Dexer: Detecting and Explaining Biased Representation in Ranking

https://doi.org/10.1145/3555041.3589725

Moskovitch, Yuval; Li, Jinyang; Jagadish, H. V. (June 2023, ACM)

Full Text Available

« Prev Next »

Search for: All records