skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improving medical machine learning models with generative balancing for equity and excellence
Abstract Applying machine learning to clinical outcome prediction is challenging due to imbalanced datasets and sensitive tasks that contain rare yet critical outcomes and where equitable treatment across diverse patient groups is essential. Despite attempts, biases in predictions persist, driven by disparities in representation and exacerbated by the scarcity of positive labels, perpetuating health inequities. This paper introduces , a synthetic data generation approach leveraging large language models, to address these issues. enhances algorithmic performance and reduces bias by creating realistic, anonymous synthetic patient data that improves representation and augments dataset patterns while preserving privacy. Through experiments on multiple datasets, we demonstrate that boosts mortality prediction performance across diverse subgroups, achieving up to a 21% improvement in F1 Score without requiring additional data or altering downstream training pipelines. Furthermore, consistently reduces subgroup performance gaps, as shown by universal improvements in performance and fairness metrics across four experimental setups.  more » « less
Award ID(s):
2205289
PAR ID:
10571913
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
npj Digital Medicine
Volume:
8
Issue:
1
ISSN:
2398-6352
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationRecent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. ResultsHerein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer’s encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. Availability and implementationLMCrot is publicly available at https://github.com/KCLabMTU/LMCrot. 
    more » « less
  2. The continuous evolution of the IoT paradigm has been extensively applied across various application domains, including air traffic control, education, healthcare, agriculture, transportation, smart home appliances, and others. Our primary focus revolves around exploring the applications of IoT, particularly within healthcare, where it assumes a pivotal role in facilitating secure and real-time remote patient-monitoring systems. This innovation aims to enhance the quality of service and ultimately improve people’s lives. A key component in this ecosystem is the Healthcare Monitoring System (HMS), a technology-based framework designed to continuously monitor and manage patient and healthcare provider data in real time. This system integrates various components, such as software, medical devices, and processes, aimed at improvi1g patient care and supporting healthcare providers in making well-informed decisions. This fosters proactive healthcare management and enables timely interventions when needed. However, data transmission in these systems poses significant security threats during the transfer process, as malicious actors may attempt to breach security protocols.This jeopardizes the integrity of the Internet of Medical Things (IoMT) and ultimately endangers patient safety. Two feature sets—biometric and network flow metric—have been incorporated to enhance detection in healthcare systems. Another major challenge lies in the scarcity of publicly available balanced datasets for analyzing diverse IoMT attack patterns. To address this, the Auxiliary Classifier Generative Adversarial Network (ACGAN) was employed to generate synthetic samples that resemble minority class samples. ACGAN operates with two objectives: the discriminator differentiates between real and synthetic samples while also predicting the correct class labels. This dual functionality ensures that the discriminator learns detailed features for both tasks. Meanwhile, the generator produces high-quality samples that are classified as real by the discriminator and correctly labeled by the auxiliary classifier. The performance of this approach, evaluated using the IoMT dataset, consistently outperforms the existing baseline model across key metrics, including accuracy, precision, recall, F1-score, area under curve (AUC), and confusion matrix results. 
    more » « less
  3. Abstract INTRODUCTIONIdentifying mild cognitive impairment (MCI) patients at risk for dementia could facilitate early interventions. Using electronic health records (EHRs), we developed a model to predict MCI to all‐cause dementia (ACD) conversion at 5 years. METHODSCox proportional hazards model was used to identify predictors of ACD conversion from EHR data in veterans with MCI. Model performance (area under the receiver operating characteristic curve [AUC] and Brier score) was evaluated on a held‐out data subset. RESULTSOf 59,782 MCI patients, 15,420 (25.8%) converted to ACD. The model had good discriminative performance (AUC 0.73 [95% confidence interval (CI) 0.72–0.74]), and calibration (Brier score 0.18 [95% CI 0.17–0.18]). Age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors, while body mass index, alcohol abuse, and sleep apnea were protective factors. DISCUSSIONEHR‐based prediction model had good performance in identifying 5‐year MCI to ACD conversion and has potential to assist triaging of at‐risk patients. HighlightsOf 59,782 veterans with mild cognitive impairment (MCI), 15,420 (25.8%) converted to all‐cause dementia within 5 years.Electronic health record prediction models demonstrated good performance (area under the receiver operating characteristic curve 0.73; Brier 0.18).Age and vascular‐related morbidities were predictors of dementia conversion.Synthetic data was comparable to real data in modeling MCI to dementia conversion. Key PointsAn electronic health record–based model using demographic and co‐morbidity data had good performance in identifying veterans who convert from mild cognitive impairment (MCI) to all‐cause dementia (ACD) within 5 years.Increased age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors for 5‐year conversion from MCI to ACD.High body mass index, alcohol abuse, and sleep apnea were protective factors for 5‐year conversion from MCI to ACD.Models using synthetic data, analogs of real patient data that retain the distribution, density, and covariance between variables of real patient data but are not attributable to any specific patient, performed just as well as models using real patient data. This could have significant implications in facilitating widely distributed computing of health‐care data with minimized patient privacy concern that could accelerate scientific discoveries. 
    more » « less
  4. Abstract Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (JointAdaptive predictioN-regionEstimation forTime-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlledK-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET’s superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data. 
    more » « less
  5. Abstract Designing for manufacturing poses significant challenges in part due to the computation bottleneck of Computer-Aided Manufacturing (CAM) simulations. Although deep learning as an alternative offers fast inference, its performance is dependently bounded by the need for abundant training data. Representation learning, particularly through pre-training, offers promise for few-shot learning, aiding in manufacturability tasks where data can be limited. This work introduces VIRL, a Volume-Informed Representation Learning approach to pre-train a 3D geometric encoder. The pretrained model is evaluated across four manufacturability indicators obtained from CAM simulations: subtractive machining (SM) time, additive manufacturing (AM) time, residual von Mises stress, and blade collisions during Laser Power Bed Fusion process. Across all case studies, the model pre-trained by VIRL shows substantial enhancements in generalizability, as measured by R2 regression results, with improved performance on limited data and superior predictive accuracy with larger datasets. Regarding deployment strategy, case-specific phenomenon exists where finetuning VIRL-pretrained models adversely affects AM tasks with limited data but benefits SM time prediction. Moreover, the efficacy of Low-rank adaptation (LoRA), which balances between probing and finetuning, is explored. LoRA shows stable performance akin to probing with limited data, while achieving a higher upper bound than probing as data size increases, without the computational costs of finetuning. Furthermore, static normalization of manufacturing indicators consistently performs well across tasks, while dynamic normalization enhances performance when a reliable task dependent input is available. 
    more » « less