skip to main content

Title: A platform for phenotyping disease progression and associated longitudinal risk factors in large-scale EHRs, with application to incident diabetes complications in the UK Biobank
Abstract Objective Modern healthcare data reflect massive multi-level and multi-scale information collected over many years. The majority of the existing phenotyping algorithms use case–control definitions of disease. This paper aims to study the time to disease onset and progression and identify the time-varying risk factors that drive them. Materials and Methods We developed an algorithmic approach to phenotyping the incidence of diseases by consolidating data sources from the UK Biobank (UKB), including primary care electronic health records (EHRs). We focused on defining events, event dates, and their censoring time, including relevant terms and existing phenotypes, excluding generic, rare, or semantically distant terms, forward-mapping terminology terms, and expert review. We applied our approach to phenotyping diabetes complications, including a composite cardiovascular disease (CVD) outcome, diabetic kidney disease (DKD), and diabetic retinopathy (DR), in the UKB study. Results We identified 49 049 participants with diabetes. Among them, 1023 had type 1 diabetes (T1D), and 40 193 had type 2 diabetes (T2D). A total of 23 833 diabetes subjects had linked primary care records. There were 3237, 3113, and 4922 patients with CVD, DKD, and DR events, respectively. The risk prediction performance for each outcome was assessed, and our results are consistent with the prediction area under the ROC (receiver operating characteristic) curve (AUC) of standard risk prediction models using cohort studies. Discussion and Conclusion Our publicly available pipeline and platform enable streamlined curation of incidence events, identification of time-varying risk factors underlying disease progression, and the definition of a relevant cohort for time-to-event analyses. These important steps need to be considered simultaneously to study disease progression.  more » « less
Award ID(s):
2054253 2205441
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Geographically-based screening policies for diabetic retinopathy (DR) can be effective in developing teleretinal imaging (TRI) guidelines while identifying patients with limited geographic access to eye care. This study conducts cost-effectiveness analysis of different screening policies for urban and rural diabetic patients in Western Pennsylvania. A Monte Carlo simulation model was used to evaluate the cost-effectiveness of 2 standardized screening policies (annual clinic-based screening (ACS) and annual TRI-based screening (ATRI)) and a personalized TRI-based screening policy (PTRI) for both urban and rural cohorts. PTRI was generated by a previously developed mathematical model that autonomously makes semi-annual screening recommendations based on each patient’s disease progression and compliance (Dorali et al. IOVS 2022; 63(7)). For each policy, hypothetical urban and rural cohorts of 50,000 patients were simulated and lifetime QALYs and costs were collected for each patient. TRI compliance rates were derived from electronic medical records. Compliance with clinic-based screening was selected from literature-based values (12-45% for rural patients and 50-65% for urban patients). For a base case urban cohort with an A1C level of 7% and entering age of 40, costs per QALY gain (CPQ) for ACS, ATRI, and PTRI were $744.93±1.57, $792.38±1.64, and $714.60±1.56, respectively; PTRI produced more cost saving than ACS with the same QALY gain (See Fig 1). For a base case rural cohort, CPQ for ACS, ATRI, and PTRI were $869.15±1.80, $819.24±1.88, and $761.51±1.42, respectively; both ATRI and PTRI dominated ACS in QALY gains and cost saving (Fig 1). PTRI recommended TRI more to rural patients (94.13±0.01%) than to urban patients (87.20±0.02%). For the rural cohort, the minimum average TRI compliance rate such that ATRI is more cost-effective than ACS was 56% (Fig 2). TRI-based screening was found more beneficial for rural patients. PTRI was found dominant in QALY gain and cost saving for both urban and rural cohorts against standardized policies. These findings suggest that TRI is best utilized when location-specific factors such as geographic access to care or TRI compliance are considered. 
    more » « less
  2. Glaucoma is a multifactorial disease and a leading cause of irreversible blindness worldwide. Current data has demonstrated the approximate distribution of primary openangle glaucoma (POAG) in patients of European, African, Hispanic, and Eastern Asian descent. However, a significant gap in the literature exists regarding the prevalence of POAG in Middle Eastern (ME) populations. Current studies estimate ME POAG prevalence based on a European model. Herein we screened 65 total publications on ME prevalence of POAG and specific risk factors using keywords: “glaucoma”, “prevalence”, “incidence”, “risk factor”, “Middle East”, “Mideast”, “Persian”, “Far East”, as well as searching by individual ME countries through PubMed, Embase, Ovid, Scopus, and Trip searches with additional reference list searches from relevant articles published up to and including March 1, 2021. Fifty qualifying records were included after 15 studies identified with low statistical power, confounding co-morbid ophthalmic diseases, and funding bias were excluded. Studies of ME glaucoma risk factors that identify chromosomes, familial trend, age/gender, socioeconomic status, lifestyle, intraocular pressure, vascular influences, optic disc hemorrhage, cup-to-disc ratio, blood pressure, obstructive sleep apnea, and diabetes mellitus were included in this systematic review. We conclude that the prevalence of POAG in the ME is likely higher than the prevalence rate that European models suggest, with ME specific risk factors likely playing a role. However, these findings are severely limited by the paucity of population-level data in the ME. Well-designed, longitudinal population-based studies with rigorous inclusion and exclusion criteria are ultimately needed to accurately assess the epidemiology and specific mechanistic risk factors of glaucoma in ME populations. 
    more » « less
  3. Abstract STUDY QUESTION

    To what extent is preconception maternal or paternal coronavirus disease 2019 (COVID-19) vaccination associated with miscarriage incidence?


    COVID-19 vaccination in either partner at any time before conception is not associated with an increased rate of miscarriage.


    Several observational studies have evaluated the safety of COVID-19 vaccination during pregnancy and found no association with miscarriage, though no study prospectively evaluated the risk of early miscarriage (gestational weeks [GW] <8) in relation to COVID-19 vaccination. Moreover, no study has evaluated the role of preconception vaccination in both male and female partners.


    An Internet-based, prospective preconception cohort study of couples residing in the USA and Canada. We analyzed data from 1815 female participants who conceived during December 2020–November 2022, including 1570 couples with data on male partner vaccination.


    Eligible female participants were aged 21–45 years and were trying to conceive without use of fertility treatment at enrollment. Female participants completed questionnaires at baseline, every 8 weeks until pregnancy, and during early and late pregnancy; they could also invite their male partners to complete a baseline questionnaire. We collected data on COVID-19 vaccination (brand and date of doses), history of SARS-CoV-2 infection (yes/no and date of positive test), potential confounders (demographic, reproductive, and lifestyle characteristics), and pregnancy status on all questionnaires. Vaccination status was categorized as never (0 doses before conception), ever (≥1 dose before conception), having a full primary sequence before conception, and completing the full primary sequence ≤3 months before conception. These categories were not mutually exclusive. Participants were followed up from their first positive pregnancy test until miscarriage or a censoring event (induced abortion, ectopic pregnancy, loss to follow-up, 20 weeks’ gestation), whichever occurred first. We estimated incidence rate ratios (IRRs) for miscarriage and corresponding 95% CIs using Cox proportional hazards models with GW as the time scale. We used propensity score fine stratification weights to adjust for confounding.


    Among 1815 eligible female participants, 75% had received at least one dose of a COVID-19 vaccine by the time of conception. Almost one-quarter of pregnancies resulted in miscarriage, and 75% of miscarriages occurred <8 weeks’ gestation. The propensity score-weighted IRR comparing female participants who received at least one dose any time before conception versus those who had not been vaccinated was 0.85 (95% CI: 0.63, 1.14). COVID-19 vaccination was not associated with increased risk of either early miscarriage (GW: <8) or late miscarriage (GW: 8–19). There was no indication of an increased risk of miscarriage associated with male partner vaccination (IRR = 0.90; 95% CI: 0.56, 1.44).


    The present study relied on self-reported vaccination status and infection history. Thus, there may be some non-differential misclassification of exposure status. While misclassification of miscarriage is also possible, the preconception cohort design and high prevalence of home pregnancy testing in this cohort reduced the potential for under-ascertainment of miscarriage. As in all observational studies, residual or unmeasured confounding is possible.


    This is the first study to evaluate prospectively the relation between preconception COVID-19 vaccination in both partners and miscarriage, with more complete ascertainment of early miscarriages than earlier studies of vaccination. The findings are informative for individuals planning a pregnancy and their healthcare providers.


    This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development, the National Institute of Health [R01-HD086742 (PI: L.A.W.); R01-HD105863S1 (PI: L.A.W. and M.L.E.)], the National Institute of Allergy and Infectious Diseases (R03-AI154544; PI: A.K.R.), and the National Science Foundation (NSF-1914792; PI: L.A.W.). The funders had no role in the study design, data collection, analysis and interpretation of data, writing of the report, or the decision to submit the paper for publication. L.A.W. is a fibroid consultant for AbbVie, Inc. She also receives in-kind donations from Swiss Precision Diagnostics (Clearblue home pregnancy tests) and (fertility apps). M.L.E. received consulting fees from Ro, Hannah, Dadi, VSeat, and Underdog, holds stock in Ro, Hannah, Dadi, and Underdog, is a past president of SSMR, and is a board member of SMRU. K.F.H. reports being an investigator on grants to her institution from UCB and Takeda, unrelated to this study. S.H.-D. reports being an investigator on grants to her institution from Takeda, unrelated to this study, and a methods consultant for UCB and Roche for unrelated drugs. The authors report no other relationships or activities that could appear to have influenced the submitted work.



    more » « less
  4. Abstract INTRODUCTION

    Identifying mild cognitive impairment (MCI) patients at risk for dementia could facilitate early interventions. Using electronic health records (EHRs), we developed a model to predict MCI to all‐cause dementia (ACD) conversion at 5 years.


    Cox proportional hazards model was used to identify predictors of ACD conversion from EHR data in veterans with MCI. Model performance (area under the receiver operating characteristic curve [AUC] and Brier score) was evaluated on a held‐out data subset.


    Of 59,782 MCI patients, 15,420 (25.8%) converted to ACD. The model had good discriminative performance (AUC 0.73 [95% confidence interval (CI) 0.72–0.74]), and calibration (Brier score 0.18 [95% CI 0.17–0.18]). Age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors, while body mass index, alcohol abuse, and sleep apnea were protective factors.


    EHR‐based prediction model had good performance in identifying 5‐year MCI to ACD conversion and has potential to assist triaging of at‐risk patients.


    Of 59,782 veterans with mild cognitive impairment (MCI), 15,420 (25.8%) converted to all‐cause dementia within 5 years.

    Electronic health record prediction models demonstrated good performance (area under the receiver operating characteristic curve 0.73; Brier 0.18).

    Age and vascular‐related morbidities were predictors of dementia conversion.

    Synthetic data was comparable to real data in modeling MCI to dementia conversion.

    Key Points

    An electronic health record–based model using demographic and co‐morbidity data had good performance in identifying veterans who convert from mild cognitive impairment (MCI) to all‐cause dementia (ACD) within 5 years.

    Increased age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors for 5‐year conversion from MCI to ACD.

    High body mass index, alcohol abuse, and sleep apnea were protective factors for 5‐year conversion from MCI to ACD.

    Models using synthetic data, analogs of real patient data that retain the distribution, density, and covariance between variables of real patient data but are not attributable to any specific patient, performed just as well as models using real patient data. This could have significant implications in facilitating widely distributed computing of health‐care data with minimized patient privacy concern that could accelerate scientific discoveries.

    more » « less
  5. null (Ed.)
    Background Cardiovascular disease (CVD) disparities are a particularly devastating manifestation of health inequity. Despite advancements in prevention and treatment, CVD is still the leading cause of death in the United States. Additionally, research indicates that African American (AA) and other ethnic-minority populations are affected by CVD at earlier ages than white Americans. Given that AAs are the fastest-growing population of smartphone owners and users, mobile health (mHealth) technologies offer the unparalleled potential to prevent or improve self-management of chronic disease among this population. Objective To address the unmet need for culturally tailored primordial prevention CVD–focused mHealth interventions, the MOYO app was cocreated with the involvement of young people from this priority community. The overall project aims to develop and evaluate the effectiveness of a novel smartphone app designed to reduce CVD risk factors among urban-AAs, 18-29 years of age. Methods The theoretical underpinning will combine the principles of community-based participatory research and the agile software development framework. The primary outcome goals of the study will be to determine the usability, acceptability, and functionality of the MOYO app, and to build a cloud-based data collection infrastructure suitable for digital epidemiology in a disparity population. Changes in health-related parameters over a 24-week period as determined by both passive (eg, physical activity levels, sleep duration, social networking) and active (eg, use of mood measures, surveys, uploading pictures of meals and blood pressure readings) measures will be the secondary outcome. Participants will be recruited from a majority AA “large city” school district, 2 historically black colleges or universities, and 1 urban undergraduate college. Following baseline screening for inclusion (administered in person), participants will receive the beta version of the MOYO app. Participants will be monitored during a 24-week pilot period. Analyses of varying data including social network dynamics, standard metrics of activity, percentage of time away from a given radius of home, circadian rhythm metrics, and proxies for sleep will be performed. Together with external variables (eg, weather, pollution, and socioeconomic indicators such as food access), these metrics will be used to train machine-learning frameworks to regress them on the self-reported quality of life indicators. Results This 5-year study (2015-2020) is currently in the implementation phase. We believe that MOYO can build upon findings of classical epidemiology and longitudinal studies like the Jackson Heart Study by adding greater granularity to our knowledge of the exposures and behaviors that affect health and disease, and creating a channel for outreach capable of launching interventions, clinical trials, and enhancements of health literacy. Conclusions The results of this pilot will provide valuable information about community cocreation of mHealth programs, efficacious design features, and essential infrastructure for digital epidemiology among young AA adults. International Registered Report Identifier (IRRID) DERR1-10.2196/16699 
    more » « less