NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

https://doi.org/10.1609/aaai.v39i22.34540

Cox, Kyle; Xu, Jiawei; Han, Yikun; Xu, Rong; Li, Tianhao; Hsu, Chi-Yang; Chen, Tianlong; Gerych, Walter; Ding, Ying (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.
more » « less
Free, publicly-accessible full text available April 11, 2026
A data-centric perspective to fair machine learning for healthcare

https://doi.org/10.1038/s43586-024-00371-x

Zhang, Haoran; Gerych, Walter; Ghassemi, Marzyeh (December 2024, Nature Reviews Methods Primers)

Full Text Available
Knowledge Amalgamation for Multi-Label Classification via Label Dependency Transfer

https://doi.org/10.1609/aaai.v37i8.26190

Thadajarassiri, Jidapa; Hartvigsen, Thomas; Gerych, Walter; Kong, Xiangnan; Rundensteiner, Elke (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Multi-label classification (MLC), which assigns multiple labels to each instance, is crucial to domains from computer vision to text mining. Conventional methods for MLC require huge amounts of labeled data to capture complex dependencies between labels. However, such labeled datasets are expensive, or even impossible, to acquire. Worse yet, these pre-trained MLC models can only be used for the particular label set covered in the training data. Despite this severe limitation, few methods exist for expanding the set of labels predicted by pre-trained models. Instead, we acquire vast amounts of new labeled data and retrain a new model from scratch. Here, we propose combining the knowledge from multiple pre-trained models (teachers) to train a new student model that covers the union of the labels predicted by this set of teachers. This student supports a broader label set than any one of its teachers without using labeled data. We call this new problem knowledge amalgamation for multi-label classification. Our new method, Adaptive KNowledge Transfer (ANT), trains a student by learning from each teacher’s partial knowledge of label dependencies to infer the global dependencies between all labels across the teachers. We show that ANT succeeds in unifying label dependencies among teachers, outperforming five state-of-the-art methods on eight real-world datasets.
more » « less
Full Text Available
HAR-CTGAN: A Mobile Sensor Data Generation Tool for Human Activity Recognition

https://doi.org/10.1109/BigData55660.2022.10020848

DeOliveira, Joshua; Gerych, Walter; Koshkarova, Aruzhan; Rundensteiner, Elke; Agu, Emmanuel (December 2022, 2022 IEEE International Conference on Big Data (Big Data))

Human activity recognition (HAR) is the process of using mobile sensor data to determine the physical activities performed by individuals. HAR is the backbone of many mobile healthcare applications, such as passive health monitoring systems, early diagnosing systems, and fall detection systems. Effective HAR models rely on deep learning architectures and big data in order to accurately classify activities. Unfortunately, HAR datasets are expensive to collect, are often mislabeled, and have large class imbalances. State-of-the-art approaches to address these challenges utilize Generative Adversarial Networks (GANs) for generating additional synthetic data along with their labels. Problematically, these HAR GANs only synthesize continuous features — features that are represented by real numbers — recorded from gyroscopes, accelerometers, and other sensors that produce continuous data. This is limiting since mobile sensor data commonly has discrete features that provide additional context such as device location and the time-of-day, which have been shown to substantially improve HAR classification. Hence, we studied Conditional Tabular Generative Adversarial Networks (CTGANs) for data generation to synthesize mobile sensor data containing both continuous and discrete features, a task never been done by state-of-the-art approaches. We show HAR-CTGANs generate data with greater realism resulting in allowing better downstream performance in HAR models, and when state-of-the-art models were modified with HAR-CTGAN characteristics, downstream performance also improves.
more » « less
Full Text Available
Domain Adaptation Methods for Lab-to-Field Human Context Recognition

https://doi.org/10.3390/s23063081

Alajaji, Abdulaziz; Gerych, Walter; Buquicchio, Luke; Chandrasekaran, Kavin; Mansoor, Hamid; Agu, Emmanuel; Rundensteiner, Elke (March 2023, Sensors)

Human context recognition (HCR) using sensor data is a crucial task in Context-Aware (CA) applications in domains such as healthcare and security. Supervised machine learning HCR models are trained using smartphone HCR datasets that are scripted or gathered in-the-wild. Scripted datasets are most accurate because of their consistent visit patterns. Supervised machine learning HCR models perform well on scripted datasets but poorly on realistic data. In-the-wild datasets are more realistic, but cause HCR models to perform worse due to data imbalance, missing or incorrect labels, and a wide variety of phone placements and device types. Lab-to-field approaches learn a robust data representation from a scripted, high-fidelity dataset, which is then used for enhancing performance on a noisy, in-the-wild dataset with similar labels. This research introduces Triplet-based Domain Adaptation for Context REcognition (Triple-DARE), a lab-to-field neural network method that combines three unique loss functions to enhance intra-class compactness and inter-class separation within the embedding space of multi-labeled datasets: (1) domain alignment loss in order to learn domain-invariant embeddings; (2) classification loss to preserve task-discriminative features; and (3) joint fusion triplet loss. Rigorous evaluations showed that Triple-DARE achieved 6.3% and 4.5% higher F1-score and classification, respectively, than state-of-the-art HCR baselines and outperformed non-adaptive HCR models by 44.6% and 10.7%, respectively.
more » « less
Full Text Available
Robust Recurrent Classifier Chains for Multi-Label Learning with Missing Labels

https://doi.org/10.1145/3511808.3557438

Gerych, Walter; Hartvigsen, Thomas; Buquicchio, Luke; Agu, Emmanuel; Rundensteiner, Elke (October 2022, 31st ACM International Conference on Information & Knowledge Management)

Recurrent Classifier Chains (RCCs) are a leading approach for multi-label classification as they directly model the interdependencies between classes. Unfortunately, existing RCCs assume that every training instance is completely labeled with all its ground truth classes. In practice often only a subset of an instance's labels are annotated, while the annotations for other classes are missing. RCCs fail in this missing label scenario, predicting many false negatives and potentially missing important classes. In this work, we propose Robust-RCC, the first strategy for tackling this open problem of RCCs failing for multi-label missing-label data. Robust-RCC is a new type of deep recurrent classifier chain empowered to model inter-class relationships essential for predicting the complete label set most likely to match the ground truth. The key to Robust-RCC is the design of the Multi Incomplete Label Risk (MILR) function, which we prove to be equal in expectation to the true risk of the ground truth full label set despite being computed from incompletely labeled data. Our experimental study demonstrates that Robust-RCC consistently beats six state-of-of-the-art methods by as much as 30% in predicting the true labels.
more » « less
Full Text Available
Text Generation to Aid Depression Detection: A Comparative Study of Conditional Sequence Generative Adversarial Networks

https://doi.org/10.1109/BigData55660.2022.10020224

Tlachac, ML; Gerych, Walter; Agrawal, Kratika; Litterer, Benjamin; Jurovich, Nicholas; Thatigotla, Saitheeraj; Thadajarassiri, Jidapa; Rundensteiner, Elke A. (December 2022, 2022 IEEE International Conference on Big Data (Big Data))

Corpuses of unstructured textual data, such as text messages between individuals, are often predictive of medical issues such as depression. The text data usually used in healthcare applications has high value and great variety, but is typically small in volume. Generating labeled unstructured text data is important to improve models by augmenting these small datasets, as well as to facilitate anonymization. While methods for labeled data generation exist, not all of them generalize well to small datasets. In this work, we thus perform a much needed systematic comparison of conditional text generation models that are promising for small datasets due to their unified architectures. We identify and implement a family of nine conditional sequence generative adversarial networks for text generation, which we collectively refer to as cSeqGAN models. These models are characterized along two orthogonal design dimensions: weighting strategies and feedback mechanisms. We conduct a comparative study evaluating the generation ability of the nine cSeqGAN models on three diverse text datasets with depression and sentiment labels. To assess the quality and realism of the generated text, we use standard machine learning metrics as well as human assessment via a user study. While the unconditioned models produced predictive text, the cSeqGAN models produced more realistic text. Our comparative study lays a solid foundation and provides important insights for further text generation research, particularly for the small datasets common within the healthcare domain.
more » « less
Full Text Available
Measuring Group Advantage: A Comparative Study of Fair Ranking Metrics

https://doi.org/10.1145/3461702.3462588

Kuhlman, Caitlin; Gerych, Walter; Rundensteiner, Elke (May 2021, Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’21))
null (Ed.)
Ranking evaluation metrics play an important role in information retrieval, providing optimization objectives during development and means of assessment of deployed performance. Recently, fairness of rankings has been recognized as crucial, especially as automated systems are increasingly used for high impact decisions. While numerous fairness metrics have been proposed, a comparative analysis to understand their interrelationships is lacking. Even for fundamental statistical parity metrics which measure group advantage, it remains unclear whether metrics measure the same phenomena, or when one metric may produce different results than another. To address these open questions, we formulate a conceptual framework for analytical comparison of metrics.We prove that under reasonable assumptions, popular metrics in the literature exhibit the same behavior and that optimizing for one optimizes for all. However, our analysis also shows that the metrics vary in the degree of unfairness measured, in particular when one group has a strong majority. Based on this analysis, we design a practical statistical test to identify whether observed data is likely to exhibit predictable group bias. We provide a set of recommendations for practitioners to guide the choice of an appropriate fairness metric.
more » « less
Full Text Available

Search for: All records