skip to main content


This content will become publicly available on May 1, 2025

Title: Leveraging logical definitions and lexical features to detect missing IS-A relations in biomedical terminologies
Abstract

Biomedical terminologies play a vital role in managing biomedical data. Missing IS-A relations in a biomedical terminology could be detrimental to its downstream usages. In this paper, we investigate an approach combining logical definitions and lexical features to discover missing IS-A relations in two biomedical terminologies: SNOMED CT and the National Cancer Institute (NCI) thesaurus. The method is applied to unrelated concept-pairs within non-lattice subgraphs: graph fragments within a terminology likely to contain various inconsistencies. Our approach first compares whether the logical definition of a concept is more general than  that of the other concept. Then, we check whether the lexical features of the concept are contained in those of the other concept. If both constraints are satisfied, we suggest a potentially missing IS-A relation between the two concepts. The method identified 982 potential missing IS-A relations for SNOMED CT and 100 for NCI thesaurus. In order to assess the efficacy of our approach, a random sample of results belonging to the “Clinical Findings” and “Procedure” subhierarchies of SNOMED CT and results belonging to the “Drug, Food, Chemical or Biomedical Material” subhierarchy of the NCI thesaurus were evaluated by domain experts. The evaluation results revealed that 118 out of 150 suggestions are valid for SNOMED CT and 17 out of 20 are valid for NCI thesaurus.

 
more » « less
Award ID(s):
2047001
PAR ID:
10524223
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Journal of Biomedical Semantics
Volume:
15
Issue:
1
ISSN:
2041-1480
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning.

    Methods

    We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach.

    Results

    We applied our sequenced-based FCA approach to all the sub-hierarchies underDisease or Disorderin the NCI Thesaurus (19.08d version) and five sub-hierarchies underClinical FindingandProcedurein the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature.

    Conclusion

    Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.

     
    more » « less
  2. null (Ed.)
    Abstract Background The National Cancer Institute (NCI) Thesaurus provides reference terminology for NCI and other systems. Previously, we proposed a hybrid prototype utilizing lexical features and role definitions of concepts in non-lattice subgraphs to identify missing IS-A relations in the NCI Thesaurus. However, no domain expert evaluation was provided in our previous work. In this paper, we further enhance the hybrid approach by leveraging a novel lexical feature—roots of noun chunks within concept names. Formal evaluation of our enhanced approach is also performed. Method We first compute all the non-lattice subgraphs in the NCI Thesaurus. We model each concept using its role definitions, words and roots of noun chunks within its concept name and its ancestor’s names. Then we perform subsumption testing for candidate concept pairs in the non-lattice subgraphs to automatically detect potentially missing IS-A relations. Domain experts evaluated the validity of these relations. Results We applied our approach to 19.08d version of the NCI Thesaurus. A total of 55 potentially missing IS-A relations were identified by our approach and reviewed by domain experts. 29 out of 55 were confirmed as valid by domain experts and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing IS-A relations in the NCI Thesaurus. Conclusions The results showed that our hybrid approach by leveraging lexical features and role definitions is effective in identifying potentially missing IS-A relations in the NCI Thesaurus. 
    more » « less
  3. Abstract Objective

    SNOMED CT is the largest clinical terminology worldwide. Quality assurance of SNOMED CT is of utmost importance to ensure that it provides accurate domain knowledge to various SNOMED CT-based applications. In this work, we introduce a deep learning-based approach to uncover missing is-a relations in SNOMED CT.

    Materials and Methods

    Our focus is to identify missing is-a relations between concept-pairs exhibiting a containment pattern (ie, the set of words of one concept being a proper subset of that of the other concept). We use hierarchically related containment concept-pairs as positive instances and hierarchically unrelated containment concept-pairs as negative instances to train a model predicting whether an is-a relation exists between 2 concepts with containment pattern. The model is a binary classifier leveraging concept name features, hierarchical features, enriched lexical attribute features, and logical definition features. We introduce a cross-validation inspired approach to identify missing is-a relations among all hierarchically unrelated containment concept-pairs.

    Results

    We trained and applied our model on the Clinical finding subhierarchy of SNOMED CT (September 2019 US edition). Our model (based on the validation sets) achieved a precision of 0.8164, recall of 0.8397, and F1 score of 0.8279. Applying the model to predict actual missing is-a relations, we obtained a total of 1661 potential candidates. Domain experts performed evaluation on randomly selected 230 samples and verified that 192 (83.48%) are valid.

    Conclusions

    The results showed that our deep learning approach is effective in uncovering missing is-a relations between containment concept-pairs in SNOMED CT.

     
    more » « less
  4. null (Ed.)
    Biomedical terminologies have been increasingly used in modern biomedical research and applications to facilitate data management and ensure semantic interoperability. As part of the evolution process, new concepts are regularly added to biomedical terminologies in response to the evolving domain knowledge and emerging applications. Most existing concept enrichment methods suggest new concepts via directly importing knowledge from external sources. In this paper, we introduced a lexical method based on formal concept analysis (FCA) to identify potentially missing concepts in a given terminology by leveraging its intrinsic knowledge - concept names. We first construct the FCA formal context based on the lexical features of concepts. Then we perform multistage intersection to formalize new concepts and detect potentially missing concepts. We applied our method to the Disease or Disorder sub-hierarchy in the National Cancer Institute (NCI) Thesaurus (19.08d version) and identified a total of 8,983 potentially missing concepts. As a preliminary evaluation of our method to validate the potentially missing concepts, we further checked whether they were included in any external source terminology in the Unified Medical Language System (UMLS). The result showed that 592 out of 8,937 potentially missing concepts were found in the UMLS. 
    more » « less
  5. Missing hierarchical is-a relations and missing concepts are common quality issues in biomedical ontologies. Non-lattice subgraphs have been extensively studied for automatically identifying missing is-a relations in biomedical ontologies like SNOMED CT. However, little is known about non-lattice subgraphs’ capability to uncover new or missing concepts in biomedical ontologies. In this work, we investigate a lexical-based intersection approach based on non-lattice subgraphs to identify potential missing concepts in SNOMED CT. We first construct lexical features of concepts using their fully specified names. Then we generate hierarchically unrelated concept pairs in non-lattice subgraphs as the candidates to derive new concepts. For each candidate pair of concepts, we conduct an order-preserving intersection based on the two concepts’ lexical features, with the intersection result serving as the potential new concept name suggested. We further perform automatic validation through terminologies in the Unified Medical Language System (UMLS) and literature in PubMed. Applying this approach to the March 2021 release of SNOMED CT US Edition, we obtained 7,702 potential missing concepts, among which 1,288 were validated through UMLS and 1,309 were validated through PubMed. The results showed that non-lattice subgraphs have the potential to facilitate suggestion of new concepts for SNOMED CT. 
    more » « less