skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 27, 2026

Title: Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss
Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation.  more » « less
Award ID(s):
2423235
PAR ID:
10634297
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Luo, Ling; Recupero, Diego Reforgiato
Publisher / Repository:
MDPI
Date Published:
Journal Name:
Information
ISSN:
2078-2489
Subject(s) / Keyword(s):
Keywords: electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability concept extraction
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sepsis is a dysregulated host response to infection with high mortality and morbidity. Early detection and intervention have been shown to improve patient outcomes, but existing computational models relying on structured electronic health record data often miss contextual information from unstructured clinical notes. This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results. Manual chart review found 62% of false positives had bacterial infections, demonstrating potential clinical utility. Our findings suggest that integrating LLMs with traditional models can enhance predictive performance by leveraging unstructured data, representing a significant advance in healthcare analytics. 
    more » « less
  2. We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors. 
    more » « less
  3. Abstract BackgroundDue to intrinsic differences in data formatting, data structure, and underlying semantic information, the integration of imaging data with clinical data can be non‐trivial. Optimal integration requires robust data fusion, that is, the process of integrating multiple data sources to produce more useful information than captured by individual data sources. Here, we introduce the concept offusion qualityfor deep learning problems involving imaging and clinical data. We first provide a general theoretical framework and numerical validation of our technique. To demonstrate real‐world applicability, we then apply our technique to optimize the fusion of CT imaging and hepatic blood markers to estimate portal venous hypertension, which is linked to prognosis in patients with cirrhosis of the liver. PurposeTo develop a measurement method of optimal data fusion quality deep learning problems utilizing both imaging data and clinical data. MethodsOur approach is based on modeling the fully connected layer (FCL) of a convolutional neural network (CNN) as a potential function, whose distribution takes the form of the classical Gibbs measure. The features of the FCL are then modeled as random variables governed by state functions, which are interpreted as the different data sources to be fused. The probability density of each source, relative to the probability density of the FCL, represents a quantitative measure of source‐bias. To minimize this source‐bias and optimize CNN performance, we implement a vector‐growing encoding scheme called positional encoding, where low‐dimensional clinical data are transcribed into a rich feature space that complements high‐dimensional imaging features. We first provide a numerical validation of our approach based on simulated Gaussian processes. We then applied our approach to patient data, where we optimized the fusion of CT images with blood markers to predict portal venous hypertension in patients with cirrhosis of the liver. This patient study was based on a modified ResNet‐152 model that incorporates both images and blood markers as input. These two data sources were processed in parallel, fused into a single FCL, and optimized based on our fusion quality framework. ResultsNumerical validation of our approach confirmed that the probability density function of a fused feature space converges to a source‐specific probability density function when source data are improperly fused. Our numerical results demonstrate that this phenomenon can be quantified as a measure of fusion quality. On patient data, the fused model consisting of both imaging data and positionally encoded blood markers at the theoretically optimal fusion quality metric achieved an AUC of 0.74 and an accuracy of 0.71. This model was statistically better than the imaging‐only model (AUC = 0.60; accuracy = 0.62), the blood marker‐only model (AUC = 0.58; accuracy = 0.60), and a variety of purposely sub‐optimized fusion models (AUC = 0.61–0.70; accuracy = 0.58–0.69). ConclusionsWe introduced the concept of data fusion quality for multi‐source deep learning problems involving both imaging and clinical data. We provided a theoretical framework, numerical validation, and real‐world application in abdominal radiology. Our data suggests that CT imaging and hepatic blood markers provide complementary diagnostic information when appropriately fused. 
    more » « less
  4. Uncovering and fixing errors in biomedical terminologies is essential so that they provide accurate knowledge to downstream applications that rely on them. Non-lattice-based methods have been applied to identify various kinds of inconsistencies in different biomedical terminologies. In previous work, we have introduced two inference-based approaches that were applied in an exhaustive manner to audit hierarchical relations in the Gene Ontology: (1) Lexical-based inference framework, and (2) Subsumption-based sub-term inference framework. However, it is unclear how effective these exhaustive approaches perform compared with their corresponding non-lattice-based approaches. Therefore, in this paper, we implement the non-lattice versions of these two exhaustive approaches, and perform a comprehensive comparison between non-lattice-based and exhaustive approaches to audit the Gene Ontology. The domain expert evaluations performed for the two exhaustive approaches are leveraged to evaluate the non-lattice versions. The results indicate that the non-lattice versions have increased precision than their exhaustive counterparts even though they do not capture some of the potential inconsistencies that the exhaustive approaches identify. 
    more » « less
  5. Abstract Background Social and behavioral determinants of health (SBDH) are environmental and behavioral factors that often impede disease management and result in sexually transmitted infections. Despite their importance, SBDH are inconsistently documented in electronic health records (EHRs) and typically collected only in an unstructured format. Evidence suggests that structured data elements present in EHRs can contribute further to identify SBDH in the patient record. Objective Explore the automated inference of both the presence of SBDH documentation and individual SBDH risk factors in patient records. Compare the relative ability of clinical notes and structured EHR data, such as laboratory measurements and diagnoses, to support inference. Methods We attempt to infer the presence of SBDH documentation in patient records, as well as patient status of 11 SBDH, including alcohol abuse, homelessness, and sexual orientation. We compare classification performance when considering clinical notes only, structured data only, and notes and structured data together. We perform an error analysis across several SBDH risk factors. Results Classification models inferring the presence of SBDH documentation achieved good performance (F1 score: 92.7–78.7; F1 considered as the primary evaluation metric). Performance was variable for models inferring patient SBDH risk status; results ranged from F1 = 82.7 for LGBT (lesbian, gay, bisexual, and transgender) status to F1 = 28.5 for intravenous drug use. Error analysis demonstrated that lexical diversity and documentation of historical SBDH status challenge inference of patient SBDH status. Three of five classifiers inferring topic-specific SBDH documentation and 10 of 11 patient SBDH status classifiers achieved highest performance when trained using both clinical notes and structured data. Conclusion Our findings suggest that combining clinical free-text notes and structured data provide the best approach in classifying patient SBDH status. Inferring patient SBDH status is most challenging among SBDH with low prevalence and high lexical diversity. 
    more » « less