In electronic health records (EHRs) data analysis, nonparametric regression and classification using International Classification of Disease (ICD) codes as covariates remain understudied. Automated methods have been developed over the years for predicting biomedical responses using EHRs, but relatively less attention has been paid to developing patient similarity measures that use ICD codes and chronic conditions, where a chronic condition is defined as a set of ICD codes. We address this problem by first developing a string kernel function for measuring the similarity between a pair of primary chronic conditions, represented as subsets of ICD codes. Second, we extend this similarity measure to a family of covariance functions on subsets of chronic conditions. This family is used in developing Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are tuning free, so they are ideal for automated prediction of biomedical responses depending on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method provides better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.
- Award ID(s):
- 1920920
- NSF-PAR ID:
- 10228537
- Date Published:
- Journal Name:
- Applied Clinical Informatics
- Volume:
- 12
- Issue:
- 01
- ISSN:
- 1869-0327
- Page Range / eLocation ID:
- 010 to 016
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Background Aspirin‐exacerbated respiratory disease (AERD) is the triad of asthma, nasal polyposis, and sensitivity to cyclooxygenase‐1 inhibitors. Treatment options include medical management, surgical intervention, and aspirin desensitization (AsaD).
Methods AERD patients were identified using the MarketScan Database from 2009 to 2015. Patients were included using International Classification of Diseases, 9th edition (ICD‐9) codes for asthma, nasal polyposis, and drug allergy. Treatments were determined by Current Procedural Terminology (CPT) codes for drug desensitization and endonasal procedures. Geographic trends and timing of interventions between those exposed and not exposed to desensitization were explored.
Results A total of 5628 patients met inclusion criteria for AERD, with mean age 46 years, 60% female; 395 (7%) underwent AsaD and 2171 (39%) underwent sinus surgery. Among patients who were desensitized, 229 (58%) underwent surgery, of whom 201 (88%) had surgery prior to AsaD (median [quartile 1, quartile 3]; 61 days [30, 208] prior to desensitization). For patients undergoing surgery following AsaD (n = 46), surgery was performed a median of 302 (163, 758) days after AsaD. Nineteen patients had multiple surgeries post‐AsaD with median time between surgeries being 734 days (312, 1484); 261 patients were not desensitized to aspirin but did undergo multiple surgeries, with the median of the median time between surgeries being 287 days (15, 617), which is shorter than for patients post‐AsaD (
p < 0.001).Conclusion A very small percentage of AERD patients undergo AsaD. Patients who had AsaD underwent surgery approximately 2 months prior to AsaD. Patients who underwent AsaD experienced an increased time between surgeries compared to patients who did not undergo AsaD.
-
Abstract The wide‐scale adoption of electronic health records (EHRs) provides extensive information to support precision medicine and personalized health care. In addition to structured EHRs, we leverage free‐text clinical information extraction (IE) techniques to estimate optimal dynamic treatment regimes (DTRs), a sequence of decision rules that dictate how to individualize treatments to patients based on treatment and covariate history. The proposed IE of patient characteristics closely resembles “The clinical Text Analysis and Knowledge Extraction System” and employs named entity recognition, boundary detection, and negation annotation. It also utilizes regular expressions to extract numerical information. Combining the proposed IE with optimal DTR estimation, we extract derived patient characteristics and use tree‐based reinforcement learning (T‐RL) to estimate multistage optimal DTRs. IE significantly improved the estimation in counterfactual outcome models compared to using structured EHR data alone, which often include incomplete data, data entry errors, and other potentially unobserved risk factors. Moreover, including IE in optimal DTR estimation provides larger study cohorts and a broader pool of candidate tailoring variables. We demonstrate the performance of our proposed method via simulations and an application using clinical records to guide blood pressure control treatments among critically ill patients with severe acute hypertension. This joint estimation approach improves the accuracy of identifying the optimal treatment sequence by 14–24% compared to traditional inference without using IE, based on our simulations over various scenarios. In the blood pressure control application, we successfully extracted significant blood pressure predictors that are unobserved or partially missing from structured EHR.
-
Doshi-Velez, Finale ; Fackler, Jim ; Jung, Ken ; Kale, David ; Ranganath, Rajesh ; Wallace, Byron ; Wiens, Jenna (Ed.)Electronic Health Records (EHRs) provide vital contextual information to radiologists and other physicians when making a diagnosis. Unfortunately, because a given patient’s record may contain hundreds of notes and reports, identifying relevant information within these in the short time typically allotted to a case is very difficult. We propose and evaluate models that extract relevant text snippets from patient records to provide a rough case summary intended to aid physicians considering one or more diagnoses. This is hard because direct supervision (i.e., physician annotations of snippets relevant to specific diagnoses in medical records) is prohibitively expensive to collect at scale. We propose a distantly supervised strategy in which we use groups of International Classification of Diseases (ICD) codes observed in ‘future’ records as noisy proxies for ‘downstream’ diagnoses. Using this we train a transformer-based neural model to perform extractive summarization conditioned on potential diagnoses. This model defines an attention mechanism that is conditioned on potential diagnoses (queries) provided by the diagnosing physician. We train (via distant supervision) and evaluate variants of this model on EHR data from Brigham and Women’s Hospital in Boston and MIMIC-III (the latter to facilitate reproducibility). Evaluations performed by radiologists demonstrate that these distantly supervised models yield better extractive summaries than do unsupervised approaches. Such models may aid diagnosis by identifying sentences in past patient reports that are clinically relevant to a potential diagnosis. Code is available at https://github.com/dmcinerney/ehr-extraction-models.more » « less
-
Predicting polycystic ovary syndrome with machine learning algorithms from electronic health records
Introduction Predictive models have been used to aid early diagnosis of PCOS, though existing models are based on small sample sizes and limited to fertility clinic populations. We built a predictive model using machine learning algorithms based on an outpatient population at risk for PCOS to predict risk and facilitate earlier diagnosis, particularly among those who meet diagnostic criteria but have not received a diagnosis.
Methods This is a retrospective cohort study from a SafetyNet hospital’s electronic health records (EHR) from 2003-2016. The study population included 30,601 women aged 18-45 years without concurrent endocrinopathy who had any visit to Boston Medical Center for primary care, obstetrics and gynecology, endocrinology, family medicine, or general internal medicine. Four prediction outcomes were assessed for PCOS. The first outcome was PCOS ICD-9 diagnosis with additional model outcomes of algorithm-defined PCOS. The latter was based on Rotterdam criteria and merging laboratory values, radiographic imaging, and ICD data from the EHR to define irregular menstruation, hyperandrogenism, and polycystic ovarian morphology on ultrasound.
Results We developed predictive models using four machine learning methods: logistic regression, supported vector machine, gradient boosted trees, and random forests. Hormone values (follicle-stimulating hormone, luteinizing hormone, estradiol, and sex hormone binding globulin) were combined to create a multilayer perceptron score using a neural network classifier. Prediction of PCOS prior to clinical diagnosis in an out-of-sample test set of patients achieved an average AUC of 85%, 81%, 80%, and 82%, respectively in Models I, II, III and IV. Significant positive predictors of PCOS diagnosis across models included hormone levels and obesity; negative predictors included gravidity and positive bHCG.
Conclusion Machine learning algorithms were used to predict PCOS based on a large at-risk population. This approach may guide early detection of PCOS within EHR-interfaced populations to facilitate counseling and interventions that may reduce long-term health consequences. Our model illustrates the potential benefits of an artificial intelligence-enabled provider assistance tool that can be integrated into the EHR to reduce delays in diagnosis. However, model validation in other hospital-based populations is necessary.