skip to main content


Title: Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification

Electronic health records (EHR) are not designed for population‐based research, but they provide easy and quick access to longitudinal health information for a large number of individuals. Many statistical methods have been proposed to account for selection bias, missing data, phenotyping errors, or other problems that arise in EHR data analysis. However, addressing multiple sources of bias simultaneously is challenging. We developed a methodological framework (R package,SAMBA) for jointly handling both selection bias and phenotype misclassification in the EHR setting that leverages external data sources. These methods assume factors related to selection and misclassification are fully observed, but these factors may be poorly understood and partially observed in practice. As a follow‐up to the methodological work, we demonstrate how to apply these methods for two real‐world case studies, and we evaluate their performance. In both examples, we use individual patient‐level data collected through the University of Michigan Health System and various external population‐based data sources. In case study (a), we explore the impact of these methods on estimated associations between gender and cancer diagnosis. In case study (b), we compare corrected associations between previously identified genetic loci and age‐related macular degeneration with gold standard external summary estimates. These case studies illustrate how to utilize diverse auxiliary information to achieve less biased inference in EHR‐based research.

 
more » « less
NSF-PAR ID:
10372092
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Volume:
41
Issue:
28
ISSN:
0277-6715
Page Range / eLocation ID:
p. 5501-5516
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract STUDY QUESTION

    To what extent is male fatty acid intake associated with fecundability among couples planning pregnancy?

    SUMMARY ANSWER

    We observed weak positive associations of male dietary intakes of total and saturated fatty acids with fecundability; no other fatty acid subtypes were appreciably associated with fecundability.

    WHAT IS KNOWN ALREADY

    Male fatty acid intake has been associated with semen quality in previous studies. However, little is known about the extent to which male fatty acid intake is associated with fecundability among couples attempting spontaneous conception.

    STUDY DESIGN, SIZE, DURATION

    We conducted an internet-based preconception prospective cohort study of 697 couples who enrolled during 2015–2022. During 12 cycles of observation, 53 couples (7.6%) were lost to follow-up.

    PARTICIPANTS/MATERIALS, SETTING, METHODS

    Participants were residents of the USA or Canada, aged 21–45 years, and not using fertility treatment at enrollment. At baseline, male participants completed a food frequency questionnaire from which we estimated intakes of total fat and fatty acid subtypes. We ascertained time to pregnancy using questionnaires completed every 8 weeks by female participants until conception or up to 12 months. We used proportional probabilities regression models to estimate fecundability ratios (FRs) and 95% CIs for the associations of fat intakes with fecundability, adjusting for male and female partner characteristics. We used the multivariate nutrient density method to account for energy intake, allowing for interpretation of results as fat intake replacing carbohydrate intake. We conducted several sensitivity analyses to assess the potential for confounding, selection bias, and reverse causation.

    MAIN RESULTS AND THE ROLE OF CHANCE

    Among 697 couples, we observed 465 pregnancies during 2970 menstrual cycles of follow-up. The cumulative incidence of pregnancy during 12 cycles of follow-up after accounting for censoring was 76%. Intakes of total and saturated fatty acids were weakly, positively associated with fecundability. Fully adjusted FRs for quartiles of total fat intake were 1.32 (95% CI 1.01–1.71), 1.16 (95% CI 0.88–1.51), and 1.43 (95% CI 1.09–1.88) for the second, third, and fourth vs the first quartile, respectively. Fully adjusted FRs for saturated fatty acid intake were 1.21 (95% CI 0.94–1.55), 1.16 (95% CI 0.89–1.51), and 1.23 (95% CI 0.94–1.62) for the second, third, and fourth vs the first quartile, respectively. Intakes of monounsaturated, polyunsaturated, trans-, omega-3, and omega-6 fatty acids were not strongly associated with fecundability. Results were similar after adjustment for the female partner’s intakes of trans- and omega-3 fats.

    LIMITATIONS, REASONS FOR CAUTION

    Dietary intakes estimated from the food frequency questionnaire may be subject to non-differential misclassification, which is expected to bias results toward the null in the extreme categories when exposures are modeled as quartiles. There may be residual confounding by unmeasured dietary, lifestyle, or environmental factors. Sample size was limited, especially in subgroup analyses.

    WIDER IMPLICATIONS OF THE FINDINGS

    Our results do not support a strong causal effect of male fatty acid intakes on fecundability among couples attempting to conceive spontaneously. The weak positive associations we observed between male dietary fat intakes and fecundability may reflect a combination of causal associations, measurement error, chance, and residual confounding.

    STUDY FUNDING/COMPETING INTEREST(S)

    The study was funded by the National Institutes of Health, grant numbers R01HD086742 and R01HD105863. In the last 3 years, PRESTO has received in-kind donations from Swiss Precision Diagnostics (home pregnancy tests) and Kindara.com (fertility app). L.A.W. is a consultant for AbbVie, Inc. M.L.E. is an advisor to Sandstone, Ro, Underdog, Dadi, Hannah, Doveras, and VSeat. The other authors have no competing interests to report.

    TRIAL REGISTRATION NUMBER

    N/A.

     
    more » « less
  2. Abstract Objectives

    The World Health Organization estimates that almost 300 million people suffer from depression worldwide. African Americans are understudied for depression‐related phenotypes despite widespread racial disparities. In our study of African Americans, we integrated information on psychosocial stressors with genetic variation in order to better understand how these factors associated with depressive symptoms.

    Methods

    Our research strategy combined information on financial strain and social networks with genetic data to investigate variation in symptoms of depression (CES‐D scores). We collected self‐report data on depressive symptoms, financial strain (difficulty paying bills) and personal social networks (a model of an individual's social environment), and we genotyped genetic variants in five genes previously implicated in depressive disorders (HTR1a, BDNF, GNB3, SLC6A4, andFKBP5) in 128 African Americans residing in Tallahassee, Florida. We tested for direct and gene–environment interactive effects of the psychosocial stressors and genetic variants on depressive symptoms.

    Results

    Significant associations were identified between high CES‐D scores and a stressful social environment (i.e., a high percentage of people in participants' social network who were a source of stress) and high financial strain. Only one genetic variant (rs1360780 inFKBP5) was significantly associated with CES‐D scores and only when psychosocial stressors were included in the model; the T allele had an additive effect on depressive symptoms. Sex was also significantly associated with CES‐D score in the model with psychosocial stressors and genetic variants; males had higher CES‐D scores. No significant interactive effects were detected.

    Conclusions

    A stressful social environment and material disadvantage increase depressive symptoms in the study population. Additional associations withFKBP5and male sex were revealed in models that included both psychosocial and genetic data. Our results suggest that incorporating psychosocial stressors may empower future genetic association studies and help clarify the biological consequences of social and financial stress.

     
    more » « less
  3. Abstract STUDY QUESTION To what extent is exposure to cellular telephones associated with male fertility? SUMMARY ANSWER Overall, we found little association between carrying a cell phone in the front pants pocket and male fertility, although among leaner men (BMI <25 kg/m2), carrying a cell phone in the front pants pocket was associated with lower fecundability. WHAT IS KNOWN ALREADY Some studies have indicated that cell phone use is associated with poor semen quality, but the results are conflicting. STUDY DESIGN, SIZE, DURATION Two prospective preconception cohort studies were conducted with men in Denmark (n = 751) and in North America (n = 2349), enrolled and followed via the internet from 2012 to 2020. PARTICIPANTS/MATERIALS, SETTING, METHODS On the baseline questionnaire, males reported their hours/day of carrying a cell phone in different body locations. We ascertained time to pregnancy via bi-monthly follow-up questionnaires completed by the female partner for up to 12 months or until reported conception. We used proportional probabilities regression models to estimate fecundability ratios (FRs) and 95% confidence intervals (CIs) for the association between male cell phone habits and fecundability, focusing on front pants pocket exposure, within each cohort separately and pooling across the cohorts using a fixed-effect meta-analysis. In a subset of participants, we examined selected semen parameters (semen volume, sperm concentration and sperm motility) using a home-based semen testing kit. MAIN RESULTS AND THE ROLE OF CHANCE There was little overall association between carrying a cell phone in a front pants pocket and fecundability: the FR for any front pants pocket exposure versus none was 0.94 (95% CI: 0.0.83–1.05). We observed an inverse association between any front pants pocket exposure and fecundability among men whose BMI was <25 kg/m2 (FR = 0.72, 95% CI: 0.59–0.88) but little association among men whose BMI was ≥25 kg/m2 (FR = 1.05, 95% CI: 0.90–1.22). There were few consistent associations between cell phone exposure and semen volume, sperm concentration, or sperm motility. LIMITATIONS, REASONS FOR CAUTION Exposure to radiofrequency radiation from cell phones is subject to considerable non-differential misclassification, which would tend to attenuate the estimates for dichotomous comparisons and extreme exposure categories (e.g. exposure 8 vs. 0 h/day). Residual confounding by occupation or other unknown or poorly measured factors may also have affected the results. WIDER IMPLICATIONS OF THE FINDINGS Overall, there was little association between carrying one’s phone in the front pants pocket and fecundability. There was a moderate inverse association between front pants pocket cell phone exposure and fecundability among men with BMI <25 kg/m2, but not among men with BMI ≥25 kg/m2. Although several previous studies have indicated associations between cell phone exposure and lower sperm motility, we found few consistent associations with any semen quality parameters. STUDY FUNDING/COMPETING INTEREST(S) The study was funded by the National Institutes of Health, grant number R03HD090315. In the last 3 years, PRESTO has received in-kind donations from Sandstone Diagnostics (for semen kits), Swiss Precision Diagnostics (home pregnancy tests), Kindara.com (fertility app), and FertilityFriend.com (fertility app). Dr. L.A.W. is a fibroid consultant for AbbVie, Inc. Dr. H.T.S. reports that the Department of Clinical Epidemiology is involved in studies with funding from various companies as research grants to and administered by Aarhus University. None of these studies are related to the current study. Dr. M.L.E. is an advisor to Sandstone Diagnostics, Ro, Dadi, Hannah, and Underdog. Dr. G.J.S. holds ownership in Sandstone Diagnostics Inc., developers of the Trak Male Fertility Testing System. In addition, Dr. G.J.S. has a patent pending related to Trak Male Fertility Testing System issued. TRIAL REGISTRATION NUMBER N/A 
    more » « less
  4. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less
  5. Large‐scale association analyses based on observational health care databases such as electronic health records have been a topic of increasing interest in the scientific community. However, challenges due to nonprobability sampling and phenotype misclassification associated with the use of these data sources are often ignored in standard analyses. The extent of the bias introduced by ignoring these factors is not well‐characterized. In this paper, we develop an analytic framework for characterizing the bias expected in disease‐gene association studies based on electronic health records when disease status misclassification and the sampling mechanism are ignored. Through a sensitivity analysis approach, this framework can be used to obtain plausible values for parameters of interest givensummary resultsfrom standard analysis. We develop an online tool for performing this sensitivity analysis. Simulations demonstrate promising properties of the proposed method. We apply our approach to study bias in disease‐gene association studies using electronic health record data from the Michigan Genomics Initiative, a longitudinal biorepository effort within The University Michigan health system.

     
    more » « less