Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss

Hier, Daniel B; Carrithers, Michael A; Platt, Steven K; Nguyen, Anh; Giannopolos, Ioannis; Obafemi-Ajay, Tayo

Citation Details

This content will become publicly available on May 27, 2026

Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss

Clinician notes are a rich source of patient information, but often contain inconsistencies due to varied writing styles, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder their direct use in patient care and degrade the performance of downstream computational applications that rely on these notes as input, such as quality improvement, population health analytics, precision medicine, clinical decision support, and research. We present a large-language-model (LLM) approach to the preprocessing of 1618 neurology notes. The LLM corrected spelling and grammatical errors, expanded acronyms, and standardized terminology and formatting, without altering clinical content. Expert review of randomly sampled notes confirmed that no significant information was lost. To evaluate downstream impact, we applied an ontology-based NLP pipeline (Doc2Hpo) to extract biomedical concepts from the notes before and after editing. F1 scores for Human Phenotype Ontology extraction improved from 0.40 to 0.61, confirming our hypothesis that better inputs yielded better outputs. We conclude that LLM-based preprocessing is an effective error correction strategy that improves data quality at the level of free text in clinical notes. This approach may enhance the performance of a broad class of downstream applications that derive their input from unstructured clinical documentation. more »

Award ID(s):: 2423235

PAR ID:: 10634297

Author(s) / Creator(s):: Hier, Daniel B; Carrithers, Michael A; Platt, Steven K; Nguyen, Anh; Giannopolos, Ioannis; Obafemi-Ajay, Tayo

Editor(s):: Luo, Ling; Recupero, Diego Reforgiato

Publisher / Repository:: MDPI

Date Published:: 2025-05-27

Journal Name:: Information

ISSN:: 2078-2489

Subject(s) / Keyword(s):: Keywords: electronic health records physician notes human phenotype ontology Doc2Hpo large language models data interoperability concept extraction

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 27, 2026
Journal Article:
The DOI is not currently available.

More Like this