Abstract ObjectiveEarly identification of chronic diseases is a pillar of precision medicine as it can lead to improved outcomes, reduction of disease burden, and lower healthcare costs. Predictions of a patient’s health trajectory have been improved through the application of machine learning approaches to electronic health records (EHRs). However, these methods have traditionally relied on “black box” algorithms that can process large amounts of data but are unable to incorporate domain knowledge, thus limiting their predictive and explanatory power. Here, we present a method for incorporating domain knowledge into clinical classifications by embedding individual patient data into a biomedical knowledge graph. Materials and MethodsA modified version of the Page rank algorithm was implemented to embed millions of deidentified EHRs into a biomedical knowledge graph (SPOKE). This resulted in high-dimensional, knowledge-guided patient health signatures (ie, SPOKEsigs) that were subsequently used as features in a random forest environment to classify patients at risk of developing a chronic disease. ResultsOur model predicted disease status of 5752 subjects 3 years before being diagnosed with multiple sclerosis (MS) (AUC = 0.83). SPOKEsigs outperformed predictions using EHRs alone, and the biological drivers of the classifiers provided insight into the underpinnings of prodromal MS. ConclusionUsing data from EHR as input, SPOKEsigs describe patients at both the clinical and biological levels. We provide a clinical use case for detecting MS up to 5 years prior to their documented diagnosis in the clinic and illustrate the biological features that distinguish the prodromal MS state. 
                        more » 
                        « less   
                    
                            
                            Clinical risk prediction using language models: benefits and considerations
                        
                    
    
            Abstract ObjectiveThe use of electronic health records (EHRs) for clinical risk prediction is on the rise. However, in many practical settings, the limited availability of task-specific EHR data can restrict the application of standard machine learning pipelines. In this study, we investigate the potential of leveraging language models (LMs) as a means to incorporate supplementary domain knowledge for improving the performance of various EHR-based risk prediction tasks. MethodsWe propose two novel LM-based methods, namely “LLaMA2-EHR” and “Sent-e-Med.” Our focus is on utilizing the textual descriptions within structured EHRs to make risk predictions about future diagnoses. We conduct a comprehensive comparison with previous approaches across various data types and sizes. ResultsExperiments across 6 different methods and 3 separate risk prediction tasks reveal that employing LMs to represent structured EHRs, such as diagnostic histories, results in significant performance improvements when evaluated using standard metrics such as area under the receiver operating characteristic (ROC) curve and precision-recall (PR) curve. Additionally, they offer benefits such as few-shot learning, the ability to handle previously unseen medical concepts, and adaptability to various medical vocabularies. However, it is noteworthy that outcomes may exhibit sensitivity to a specific prompt. ConclusionLMs encompass extensive embedded knowledge, making them valuable for the analysis of EHRs in the context of risk prediction. Nevertheless, it is important to exercise caution in their application, as ongoing safety concerns related to LMs persist and require continuous consideration. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10492744
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the American Medical Informatics Association
- Volume:
- 31
- Issue:
- 9
- ISSN:
- 1067-5027
- Format(s):
- Medium: X Size: p. 1856-1864
- Size(s):
- p. 1856-1864
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract IntroductionStudies investigating the relationship between blood pressure (BP) measurements from electronic health records (EHRs) and Alzheimer's disease (AD) rely on summary statistics, like BP variability, and have only been validated at a single institution. We hypothesize that leveraging BP trajectories can accurately estimate AD risk across different populations. MethodsIn a retrospective cohort study, EHR data from Veterans Affairs (VA) patients were used to train and internally validate a machine learning model to predict AD onset within 5 years. External validation was conducted on patients from Michigan Medicine (MM). ResultsThe VA and MM cohorts included 6860 and 1201 patients, respectively. Model performance using BP trajectories was modest but comparable (area under the receiver operating characteristic curve [AUROC] = 0.64 [95% confidence interval (CI) = 0.54–0.73] for VA vs. AUROC = 0.66 [95% CI = 0.55–0.76] for MM). ConclusionApproaches that directly leverage BP trajectories from EHR data could aid in AD risk stratification across institutions.more » « less
- 
            Electronic health records (EHRs) have been heavily used in modern healthcare systems for recording patients' admission information to health facilities. Many data-driven approaches employ temporal features in EHR for predicting specific diseases, readmission times, and diagnoses of patients. However, most existing predictive models cannot fully utilize EHR data, due to an inherent lack of labels in supervised training for some temporal events. Moreover, it is hard for the existing methods to simultaneously provide generic and personalized interpretability. To address these challenges, we propose Sherbet, a self-supervised graph learning framework with hyperbolic embeddings for temporal health event prediction. We first propose a hyperbolic embedding method with information flow to pretrain medical code representations in a hierarchical structure. We incorporate these pretrained representations into a graph neural network (GNN) to detect disease complications and design a multilevel attention method to compute the contributions of particular diseases and admissions, thus enhancing personalized interpretability. We present a new hierarchy-enhanced historical prediction proxy task in our self-supervised learning framework to fully utilize EHR data and exploit medical domain knowledge. We conduct a comprehensive set of experiments on widely used publicly available EHR datasets to verify the effectiveness of our model. Our results demonstrate the proposed model's strengths in both predictive tasks and interpretable abilities.more » « less
- 
            Abstract INTRODUCTIONIdentifying mild cognitive impairment (MCI) patients at risk for dementia could facilitate early interventions. Using electronic health records (EHRs), we developed a model to predict MCI to all‐cause dementia (ACD) conversion at 5 years. METHODSCox proportional hazards model was used to identify predictors of ACD conversion from EHR data in veterans with MCI. Model performance (area under the receiver operating characteristic curve [AUC] and Brier score) was evaluated on a held‐out data subset. RESULTSOf 59,782 MCI patients, 15,420 (25.8%) converted to ACD. The model had good discriminative performance (AUC 0.73 [95% confidence interval (CI) 0.72–0.74]), and calibration (Brier score 0.18 [95% CI 0.17–0.18]). Age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors, while body mass index, alcohol abuse, and sleep apnea were protective factors. DISCUSSIONEHR‐based prediction model had good performance in identifying 5‐year MCI to ACD conversion and has potential to assist triaging of at‐risk patients. HighlightsOf 59,782 veterans with mild cognitive impairment (MCI), 15,420 (25.8%) converted to all‐cause dementia within 5 years.Electronic health record prediction models demonstrated good performance (area under the receiver operating characteristic curve 0.73; Brier 0.18).Age and vascular‐related morbidities were predictors of dementia conversion.Synthetic data was comparable to real data in modeling MCI to dementia conversion. Key PointsAn electronic health record–based model using demographic and co‐morbidity data had good performance in identifying veterans who convert from mild cognitive impairment (MCI) to all‐cause dementia (ACD) within 5 years.Increased age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors for 5‐year conversion from MCI to ACD.High body mass index, alcohol abuse, and sleep apnea were protective factors for 5‐year conversion from MCI to ACD.Models using synthetic data, analogs of real patient data that retain the distribution, density, and covariance between variables of real patient data but are not attributable to any specific patient, performed just as well as models using real patient data. This could have significant implications in facilitating widely distributed computing of health‐care data with minimized patient privacy concern that could accelerate scientific discoveries.more » « less
- 
            Deep neural networks, including the Transformer architecture, have achieved remarkable performance in various time series tasks. However, their effectiveness in handling clinical time series data is hindered by specific challenges: 1) Sparse event sequences collected asynchronously with multivariate time series, and 2) Limited availability of labeled data. To address these challenges, we propose Our code is available at https://github.com/SigmaTsing/TransEHR.git . , a self-supervised Transformer model designed to encode multi-sourced asynchronous sequential data, such as structured Electronic Health Records (EHRs), efficiently. We introduce three pretext tasks for pre-training the Transformer model, utilizing large amounts of unlabeled structured EHR data, followed by fine-tuning on downstream prediction tasks using the limited labeled data. Through extensive experiments on three real-world health datasets, we demonstrate that our model achieves state-of-the-art performance on benchmark clinical tasks, including in-hospital mortality classification, phenotyping, and length-of-stay prediction. Our findings highlight the efficacy of in effectively addressing the challenges associated with clinical time series data, thus contributing to advancements in healthcare analytics.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
