Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            Abstract ObjectiveSNOMED CT provides a standardized terminology for clinical concepts, allowing cohort queries over heterogeneous clinical data including Electronic Health Records (EHRs). While it is intuitive that missing and inaccurate subtype (or is-a) relations in SNOMED CT reduce the recall and precision of cohort queries, the extent of these impacts has not been formally assessed. This study fills this gap by developing quantitative metrics to measure these impacts and performing statistical analysis on their significance. Material and MethodsWe used the Optum de-identified COVID-19 Electronic Health Record dataset. We defined micro-averaged and macro-averaged recall and precision metrics to assess the impact of missing and inaccurate is-a relations on cohort queries. Both practical and simulated analyses were performed. Practical analyses involved 407 missing and 48 inaccurate is-a relations confirmed by domain experts, with statistical testing using Wilcoxon signed-rank tests. Simulated analyses used two random sets of 400 is-a relations to simulate missing and inaccurate is-a relations. ResultsWilcoxon signed-rank tests from both practical and simulated analyses (P-values < .001) showed that missing is-a relations significantly reduced the micro- and macro-averaged recall, and inaccurate is-a relations significantly reduced the micro- and macro-averaged precision. DiscussionThe introduced impact metrics can assist SNOMED CT maintainers in prioritizing critical hierarchical defects for quality enhancement. These metrics are generally applicable for assessing the quality impact of a terminology’s subtype hierarchy on its cohort query applications. ConclusionOur results indicate a significant impact of missing and inaccurate is-a relations in SNOMED CT on the recall and precision of cohort queries. Our work highlights the importance of high-quality terminology hierarchy for cohort queries over EHR data and provides valuable insights for prioritizing quality improvements of SNOMED CT's hierarchy.more » « less
- 
            Abstract BackgroundBiomedical ontologies are representations of biomedical knowledge that provide terms with precisely defined meanings. They play a vital role in facilitating biomedical research in a cross-disciplinary manner. Quality issues of biomedical ontologies will hinder their effective usage. One such quality issue is missing concepts. In this study, we introduce a logical definition-based approach to identify potential missing concepts in SNOMED CT. A unique contribution of our approach is that it is capable of obtaining both logical definitions and fully specified names for potential missing concepts. MethodThe logical definitions of unrelated pairs of fully defined concepts in non-lattice subgraphs that indicate quality issues are intersected to generate the logical definitions of potential missing concepts. A text summarization model (called PEGASUS) is fine-tuned to predict the fully specified names of the potential missing concepts from their generated logical definitions. Furthermore, the identified potential missing concepts are validated using external resources including the Unified Medical Language System (UMLS), biomedical literature in PubMed, and a newer version of SNOMED CT. ResultsFrom the March 2021 US Edition of SNOMED CT, we obtained a total of 30,313 unique logical definitions for potential missing concepts through the intersecting process. We fine-tuned a PEGASUS summarization model with 289,169 training instances and tested it on 36,146 instances. The model achieved 72.83 of ROUGE-1, 51.06 of ROUGE-2, and 71.76 of ROUGE-L on the test dataset. The model correctly predicted 11,549 out of 36,146 fully specified names in the test dataset. Applying the fine-tuned model on the 30,313 unique logical definitions, 23,031 total potential missing concepts were identified. Out of these, a total of 2,312 (10.04%) were automatically validated by either of the three resources. ConclusionsThe results showed that our logical definition-based approach for identification of potential missing concepts in SNOMED CT is encouraging. Nevertheless, there is still room for improving the performance of naming concepts based on logical definitions.more » « less
- 
            Abstract Biomedical terminologies play a vital role in managing biomedical data. Missing IS-A relations in a biomedical terminology could be detrimental to its downstream usages. In this paper, we investigate an approach combining logical definitions and lexical features to discover missing IS-A relations in two biomedical terminologies: SNOMED CT and the National Cancer Institute (NCI) thesaurus. The method is applied to unrelated concept-pairs within non-lattice subgraphs: graph fragments within a terminology likely to contain various inconsistencies. Our approach first compares whether the logical definition of a concept is more general than that of the other concept. Then, we check whether the lexical features of the concept are contained in those of the other concept. If both constraints are satisfied, we suggest a potentially missing IS-A relation between the two concepts. The method identified 982 potential missing IS-A relations for SNOMED CT and 100 for NCI thesaurus. In order to assess the efficacy of our approach, a random sample of results belonging to the “Clinical Findings” and “Procedure” subhierarchies of SNOMED CT and results belonging to the “Drug, Food, Chemical or Biomedical Material” subhierarchy of the NCI thesaurus were evaluated by domain experts. The evaluation results revealed that 118 out of 150 suggestions are valid for SNOMED CT and 17 out of 20 are valid for NCI thesaurus.more » « less
- 
            Abstract ObjectiveSNOMED CT is the largest clinical terminology worldwide. Quality assurance of SNOMED CT is of utmost importance to ensure that it provides accurate domain knowledge to various SNOMED CT-based applications. In this work, we introduce a deep learning-based approach to uncover missing is-a relations in SNOMED CT. Materials and MethodsOur focus is to identify missing is-a relations between concept-pairs exhibiting a containment pattern (ie, the set of words of one concept being a proper subset of that of the other concept). We use hierarchically related containment concept-pairs as positive instances and hierarchically unrelated containment concept-pairs as negative instances to train a model predicting whether an is-a relation exists between 2 concepts with containment pattern. The model is a binary classifier leveraging concept name features, hierarchical features, enriched lexical attribute features, and logical definition features. We introduce a cross-validation inspired approach to identify missing is-a relations among all hierarchically unrelated containment concept-pairs. ResultsWe trained and applied our model on the Clinical finding subhierarchy of SNOMED CT (September 2019 US edition). Our model (based on the validation sets) achieved a precision of 0.8164, recall of 0.8397, and F1 score of 0.8279. Applying the model to predict actual missing is-a relations, we obtained a total of 1661 potential candidates. Domain experts performed evaluation on randomly selected 230 samples and verified that 192 (83.48%) are valid. ConclusionsThe results showed that our deep learning approach is effective in uncovering missing is-a relations between containment concept-pairs in SNOMED CT.more » « less
- 
            Abstract BackgroundThe Vaccine Ontology (VO) is a biomedical ontology that standardizes vaccine annotation. Errors in VO will affect a multitude of applications that it is being used in. Quality assurance of VO is imperative to ensure that it provides accurate domain knowledge to these downstream tasks. Manual review to identify and fix quality issues (such as missing hierarchical is-a relations) is challenging given the complexity of the ontology. Automated approaches are highly desirable to facilitate the quality assurance of VO. MethodsWe developed an automated lexical approach that identifies potentially missingis-arelations in VO. First, we construct two types of VO concept-pairs: (1) linked; and (2) unlinked. Each concept-pair further derives an Acquired Term Pair (ATP) based on their lexical features. If the same ATP is obtained by a linked concept-pair and an unlinked concept-pair, this is considered to indicate a potentially missingis-arelation between the unlinked pair of concepts. ResultsApplying this approach on the 1.1.192 version of VO, we were able to identify 232 potentially missingis-arelations. A manual review by a VO domain expert on a random sample of 70 potentially missingis-arelations revealed that 65 of the cases were valid missingis-arelations in VO (a precision of 92.86%). ConclusionsThe results indicate that our approach is highly effective in identifying missingis-arelation in VO.more » « less
- 
            Free, publicly-accessible full text available November 29, 2025
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available