skip to main content

Title: Ensuring electronic medical record simulation through better training, modeling, and evaluation
Abstract Objective

Electronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process.

Materials and Methods

We propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center.


The proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and more » structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small.


These findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.

« less
 ;  ;  ;  ;  
Publication Date:
Journal Name:
Journal of the American Medical Informatics Association
Page Range or eLocation-ID:
p. 99-108
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    There are plenty of studies investigating the disparity of payer status in accessing to care. However, most studies are either disease-specific or cohort-specific. Quantifying the disparity from the level of facility through a large controlled study are rare. This study aims to examine how the payer status affects patient hospitalization from the perspective of a facility.


    We extracted all patients with visiting record in a medical center between 5/1/2009-4/30/2014, and then linked the outpatient and inpatient records three year before target admission time to patients. We conduct a retrospective observational study using a conditional logistic regression methodology. To control the illness of patients with different diseases in training the model, we construct a three-dimension variable with data stratification technology. The model is validated on a dataset distinct from the one used for training.


    Patients covered by private insurance or uninsured are less likely to be hospitalized than patients insured by government. For uninsured patients, inequity in access to hospitalization is observed. The value of standardized coefficients indicates that government-sponsored insurance has the greatest impact on improving patients’ hospitalization.


    Attention is needed on improving the access to care for uninsured patients. Also, basic preventive care services should be enhanced, especiallymore »for people insured by government. The findings can serve as a baseline from which to measure the anticipated effect of measures to reduce disparity of payer status in hospitalization.

    « less
  2. Abstract Background

    A considerable amount of various types of data have been collected during the COVID-19 pandemic, the analysis and understanding of which have been indispensable for curbing the spread of the disease. As the pandemic moves to an endemic state, the data collected during the pandemic will continue to be rich sources for further studying and understanding the impacts of the pandemic on various aspects of our society. On the other hand, naïve release and sharing of the information can be associated with serious privacy concerns.


    We use three common but distinct data types collected during the pandemic (case surveillance tabular data, case location data, and contact tracing networks) to illustrate the publication and sharing of granular information and individual-level pandemic data in a privacy-preserving manner. We leverage and build upon the concept of differential privacy to generate and release privacy-preserving data for each data type. We investigate the inferential utility of privacy-preserving information through simulation studies at different levels of privacy guarantees and demonstrate the approaches in real-life data. All the approaches employed in the study are straightforward to apply.


    The empirical studies in all three data cases suggest that privacy-preserving results based on the differentially privately sanitized datamore »can be similar to the original results at a reasonably small privacy loss ($$\epsilon \approx 1$$ϵ1). Statistical inferences based on sanitized data using the multiple synthesis technique also appear valid, with nominal coverage of 95% confidence intervals when there is no noticeable bias in point estimation. When$$\epsilon <1$$ϵ<1 and the sample size is not large enough, some privacy-preserving results are subject to bias, partially due to the bounding applied to sanitized data as a post-processing step to satisfy practical data constraints.


    Our study generates statistical evidence on the practical feasibility of sharing pandemic data with privacy guarantees and on how to balance the statistical utility of released information during this process.

    « less
  3. Abstract

    The strain on healthcare resources brought forth by the recent COVID-19 pandemic has highlighted the need for efficient resource planning and allocation through the prediction of future consumption. Machine learning can predict resource utilization such as the need for hospitalization based on past medical data stored in electronic medical records (EMR). We conducted this study on 3194 patients (46% male with mean age 56.7 (±16.8), 56% African American, 7% Hispanic) flagged as COVID-19 positive cases in 12 centers under Emory Healthcare network from February 2020 to September 2020, to assess whether a COVID-19 positive patient’s need for hospitalization can be predicted at the time of RT-PCR test using the EMR data prior to the test. Five main modalities of EMR, i.e., demographics, medication, past medical procedures, comorbidities, and laboratory results, were used as features for predictive modeling, both individually and fused together using late, middle, and early fusion. Models were evaluated in terms of precision, recall, F1-score (within 95% confidence interval). The early fusion model is the most effective predictor with 84% overall F1-score [CI 82.1–86.1]. The predictive performance of the model drops by 6 % when using recent clinical data while omitting the long-term medical history. Feature importancemore »analysis indicates that history of cardiovascular disease, emergency room visits in the past year prior to testing, and demographic factors are predictive of the disease trajectory. We conclude that fusion modeling using medical history and current treatment data can forecast the need for hospitalization for patients infected with COVID-19 at the time of the RT-PCR test.

    « less
  4. Abstract Background

    Hypertension is a prevalent cardiovascular disease with severe longer-term implications. Conventional management based on clinical guidelines does not facilitate personalized treatment that accounts for a richer set of patient characteristics.


    Records from 1/1/2012 to 1/1/2020 at the Boston Medical Center were used, selecting patients with either a hypertension diagnosis or meeting diagnostic criteria (≥ 130 mmHg systolic or ≥ 90 mmHg diastolic, n = 42,752). Models were developed to recommend a class of antihypertensive medications for each patient based on their characteristics. Regression immunized against outliers was combined with a nearest neighbor approach to associate with each patient an affinity group of other patients. This group was then used to make predictions of future Systolic Blood Pressure (SBP) under each prescription type. For each patient, we leveraged these predictions to select the class of medication that minimized their future predicted SBP.


    The proposed model, built with a distributionally robust learning procedure, leads to a reduction of 14.28 mmHg in SBP, on average. This reduction is 70.30% larger than the reduction achieved by the standard-of-care and 7.08% better than the corresponding reduction achieved by the 2nd best model which uses ordinary least squares regression. All derived models outperform following the previous prescription or the current ground truth prescriptionmore »in the record. We randomly sampled and manually reviewed 350 patient records; 87.71% of these model-generated prescription recommendations passed a sanity check by clinicians.


    Our data-driven approach for personalized hypertension treatment yielded significant improvement compared to the standard-of-care. The model implied potential benefits of computationally deprescribing and can support situations with clinical equipoise.

    « less
  5. Abstract Motivation

    Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology that enables the visualization of subcellular structures in situ at near-atomic resolution. Cellular cryo-ET images help in resolving the structures of macromolecules and determining their spatial relationship in a single cell, which has broad significance in cell and structural biology. Subtomogram classification and recognition constitute a primary step in the systematic recovery of these macromolecular structures. Supervised deep learning methods have been proven to be highly accurate and efficient for subtomogram classification, but suffer from limited applicability due to scarcity of annotated data. While generating simulated data for training supervised models is a potential solution, a sizeable difference in the image intensity distribution in generated data as compared with real experimental data will cause the trained models to perform poorly in predicting classes on real subtomograms.


    In this work, we present Cryo-Shift, a fully unsupervised domain adaptation and randomization framework for deep learning-based cross-domain subtomogram classification. We use unsupervised multi-adversarial domain adaption to reduce the domain shift between features of simulated and experimental data. We develop a network-driven domain randomization procedure with ‘warp’ modules to alter the simulated data and help the classifier generalize better on experimental data. We do notmore »use any labeled experimental data to train our model, whereas some of the existing alternative approaches require labeled experimental samples for cross-domain classification. Nevertheless, Cryo-Shift outperforms the existing alternative approaches in cross-domain subtomogram classification in extensive evaluation studies demonstrated herein using both simulated and experimental data.

    Availabilityand implementation

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less