skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Towards Semi-Structured Automatic ICD Coding via Tree-based Contrastive Learning
Automatic coding of International Classification of Diseases (ICD) is a multi-label text categorization task that involves extracting disease or procedure codes from clinical notes. Despite the application of state-of-the-art natural language processing (NLP) techniques, there are still challenges including limited availability of data due to privacy constraints and the high variability of clinical notes caused by different writing habits of medical professionals and various pathological features of patients. In this work, we investigate the semi-structured nature of clinical notes and propose an automatic algorithm to segment them into sections. To address the variability issues in existing ICD coding models with limited data, we introduce a contrastive pre-training approach on sections using a soft multi-label similarity metric based on tree edit distance. Additionally, we design a masked section training strategy to enable ICD coding models to locate sections related to ICD codes. Extensive experimental results demonstrate that our proposed training strategies effectively enhance the performance of existing ICD coding methods.  more » « less
Award ID(s):
2047843 1948432
PAR ID:
10556071
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
NeurIPS
Date Published:
Journal Name:
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract ObjectiveThe impact of social determinants of health (SDoH) on patients’ healthcare quality and the disparity is well known. Many SDoH items are not coded in structured forms in electronic health records. These items are often captured in free-text clinical notes, but there are limited methods for automatically extracting them. We explore a multi-stage pipeline involving named entity recognition (NER), relation classification (RC), and text classification methods to automatically extract SDoH information from clinical notes. Materials and MethodsThe study uses the N2C2 Shared Task data, which were collected from 2 sources of clinical notes: MIMIC-III and University of Washington Harborview Medical Centers. It contains 4480 social history sections with full annotation for 12 SDoHs. In order to handle the issue of overlapping entities, we developed a novel marker-based NER model. We used it in a multi-stage pipeline to extract SDoH information from clinical notes. ResultsOur marker-based system outperformed the state-of-the-art span-based models at handling overlapping entities based on the overall Micro-F1 score performance. It also achieved state-of-the-art performance compared with the shared task methods. Our approach achieved an F1 of 0.9101, 0.8053, and 0.9025 for Subtasks A, B, and C, respectively. ConclusionsThe major finding of this study is that the multi-stage pipeline effectively extracts SDoH information from clinical notes. This approach can improve the understanding and tracking of SDoHs in clinical settings. However, error propagation may be an issue and further research is needed to improve the extraction of entities with complex semantic meanings and low-frequency entities. We have made the source code available at https://github.com/Zephyr1022/SDOH-N2C2-UTSA. 
    more » « less
  2. As an integral part of qualitative research inquiry, field notes provide important data from researchers embedded in research sites. However, field notes can vary significantly, influenced by the researchers' immersion in the field, prior knowledge, beliefs, interests, and perspectives. As consequence, their interpretation presents significant challenges. This study offers a preliminary investigation into the potential of using large language models to assist researchers with the analysis and interpretation of field notes data. Our methodology consisted of two phases. First, a researcher deductively coded field notes of six classroom implementations of a novel elementary-level mathematics curriculum. In the second phase, we prompted ChatGPT-4 to code the same field notes, using the codebook, definitions, examples, and deductive coding approach employed by the researcher. We also prompted Chatgpt to provide justifications of its coding decisions We then, calculated agreements and disagreements between ChatGPT and the researcher, organized the data in a contingency table, computed Cohen's Kappa, structured the data into a confusion matrix; and using the researcher’s coding as the “gold standard”, we calculated performance measures, specifically: Accuracy, Precision, Recall, and F1 Score. Our findings revealed that while the researcher and ChatGPT appeared to generally agree on the frequency in applying the different codes, overall agreement, as measured by Cohen’s Kappa was low. In contrast, using measures from information science at the code level revealed more nuanced results. Moreover, coupled with ChatGPT justifications of coding decisions, these findings provided insights than can help support the iterative improvement of codebooks. 
    more » « less
  3. Background: In the US, obesity is an epidemiologic challenge and the population fails to comprehend this complex public health issue. To evaluate underlying obesity-impact patterns on mortality rates, we data-mined the 1999-2016 Center for Disease Control WONDER database’s vital records.Methods: Adopting SAS programming, we scrutinized the mortality and population counts. Using ICD-10 diagnosis codes connected to overweight and obesity, we obtained the obesity-related crude and age-adjusted causes of death. To understand divergent and prevalence trends we compared and contrasted the tabulated obesity-influenced mortality rates with demographic information, gender, and age-related data.Key Results: From 1999 to 2016, the obesity-related age-adjusted mortality rates increased by 142%. The ICD-10 overweight and obesity-related death-certificate coding showed clear evidence that obesity factored in the male age-adjusted mortality rate increment to 173% and the corresponding female rate to 117%. It also disproportionately affected the nation-wide minority population death rates. Furthermore, excess weight distributions are coded as contributing features in the crude death rates for all decennial age-groups.Conclusions: The 1999-2016 data from ICD-10 death certificate coding for obesity-related conditions indicate that it is affecting all segments of the US population. 
    more » « less
  4. In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected CliniDial from simulations of medical operations. CliniDial includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for CliniDial. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs’ capabilities on handling data with these characteristics. Experimental results show that CliniDial poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https: //github.com/MichiganNLP/CliniDial.† 
    more » « less
  5. Stereology-based methods provide the current state-of-the-art approaches for accurate quantification of numbers and other morphometric parameters of biological objects in stained tissue sections. The advent of artificial intelligence (AI)-based deep learning (DL) offers the possibility of improving throughput by automating the collection of stereology data. We have recently shown that DL can effectively achieve comparable accuracy to manual stereology but with higher repeatability, improved throughput, and less variation due to human factors by quantifying the total number of immunostained cells at their maximal profile of focus in extended depth of field (EDF) images. In the first of two novel contributions in this work, we propose a semi-automatic approach using a handcrafted Adaptive Segmentation Algorithm (ASA) to automatically generate ground truth on EDF images for training our deep learning (DL) models to automatically count cells using unbiased stereology methods. This update increases the amount of training data, thereby improving the accuracy and efficiency of automatic cell counting methods, without a requirement for extra expert time. The second contribution of this work is a Multi-channel Input and Multi-channel Output (MIMO) method using a U-Net deep learning architecture for automatic cell counting in a stack of z-axis images (also known as disector stacks). This DL-based digital automation of the ordinary optical fractionator ensures accurate counts through spatial separation of stained cells in the z-plane, thereby avoiding false negatives from overlapping cells in EDF images without the shortcomings of 3D and recurrent DL models. The contribution overcomes the issue of under-counting errors with EDF images due to overlapping cells in the z-plane (masking). We demonstrate the practical applications of these advances with automatic disector-based estimates of the total number of NeuN-immunostained neurons in a mouse neocortex. In summary, this work provides the first demonstration of automatic estimation of a total cell number in tissue sections using a combination of deep learning and the disector-based optical fractionator method. 
    more » « less