skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on October 7, 2025

Title: Ontology extension by online clustering with large language model agents
An ontology is a structured framework that categorizes entities, concepts, and relationships within a domain to facilitate shared understanding, and it is important in computational linguistics and knowledge representation. In this paper, we propose a novel framework to automatically extend an existing ontology from streaming data in a zero-shot manner. Specifically, the zero-shot ontology extension framework uses online and hierarchical clustering to integrate new knowledge into existing ontologies without substantial annotated data or domain-specific expertise. Focusing on the medical field, this approach leverages Large Language Models (LLMs) for two key tasks: Symptom Typing and Symptom Taxonomy among breast and bladder cancer survivors. Symptom Typing involves identifying and classifying medical symptoms from unstructured online patient forum data, while Symptom Taxonomy organizes and integrates these symptoms into an existing ontology. The combined use of online and hierarchical clustering enables real-time and structured categorization and integration of symptoms. The dual-phase model employs multiple LLMs to ensure accurate classification and seamless integration of new symptoms with minimal human oversight. The paper details the framework's development, experiments, quantitative analyses, and data visualizations, demonstrating its effectiveness in enhancing medical ontologies and advancing knowledge-based systems in healthcare.  more » « less
Award ID(s):
2403312 2318831
PAR ID:
10588490
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Frontiers
Date Published:
Journal Name:
Frontiers in Big Data
Volume:
7
ISSN:
2624-909X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Baeza-Yates, Ricardo; Bonchi, Francesco (Ed.)
    Massive amount of unstructured text data are generated daily, ranging from news articles to scientific papers. How to mine structured knowledge from the text data remains a crucial research question. Recently, large language models (LLMs) have shed light on the text mining field with their superior text understanding and instructionfollowing ability. There are typically two ways of utilizing LLMs: fine-tune the LLMs with human-annotated training data, which is labor intensive and hard to scale; prompt the LLMs in a zero-shot or few-shot way, which cannot take advantage of the useful information in the massive text data. Therefore, it remains a challenge on automated mining of structured knowledge from massive text data in the era of large language models. In this tutorial, we cover the recent advancements in mining structured knowledge using language models with very weak supervision. We will introduce the following topics in this tutorial: (1) introduction to large language models, which serves as the foundation for recent text mining tasks, (2) ontology construction, which automatically enriches an ontology from a massive corpus, (3) weakly-supervised text classification in flat and hierarchical label space, (4) weakly-supervised information extraction, which extracts entity and relation structures. 
    more » « less
  2. Extracting structured metadata from unstructured text in different domains is gaining strong attention from multiple research communities. In Political Science, these metadata play a significant role on studying intra and inter-state interactions between political entities. The process of extracting such metadata usually relies on domain specific ontologies and knowledge-based repositories. In particular, Political Scientists regularly use the well-defined ontology CAMEO, which is designed for capturing conflict and mediation relations. Since CAMEO repositories are currently human maintained, the high cost and extensive human effort associated with updating them makes it difficult to include new entries on a regular basis. This paper introduces HANKE: an innovative framework for automatically extracting knowledge representations from unstructured sources, in order to extend CAMEO ontology both in the same domain and towards other related domains in political science. HANKE combines Hierarchical Attention Networks as engine for identifying relevant structures in raw-text and the novel Frequency-Based Ranker approach to obtain a collection of candidate entries for CAMEO's repositories. To show the efficiency of the proposed framework, we evaluate its performance on capturing existing CAMEO representations in a soft-labelled dataset. We also empirically demonstrate the versatility and superiority of HANKE method by applying it to two case studies related to CAMEO extension on its actual domain and towards organized crime domain. 
    more » « less
  3. Baeza-Yates, Ricardo; Bonchi, Francesco (Ed.)
    Fine-grained entity typing (FET) is the task of identifying specific entity types at a fine-grained level for entity mentions based on their contextual information. Conventional methods for FET require extensive human annotation, which is time-consuming and costly given the massive scale of data. Recent studies have been developing weakly supervised or zero-shot approaches.We study the setting of zero-shot FET where only an ontology is provided. However, most existing ontology structures lack rich supporting information and even contain ambiguous relations, making them ineffective in guiding FET. Recently developed language models, though promising in various few-shot and zero-shot NLP tasks, may face challenges in zero-shot FET due to their lack of interaction with task-specific ontology. In this study, we propose OnEFET, where we (1) enrich each node in the ontology structure with two categories of extra information: instance information for training sample augmentation and topic information to relate types with contexts, and (2) develop a coarse-to-fine typing algorithm that exploits the enriched information by training an entailment model with contrasting topics and instance-based augmented training samples. Our experiments show that OnEFET achieves high-quality fine-grained entity typing without human annotation, outperforming existing zero-shot methods by a large margin and rivaling supervised methods. OnEFET also enjoys strong transferability to unseen and finer-grained types. Code is available at https://github.com/ozyyshr/OnEFET. 
    more » « less
  4. Baeza-Yates, Ricardo; Bonchi, Francesco (Ed.)
    Fine-grained entity typing (FET), which assigns entities in text with context-sensitive, fine-grained semantic types, is a basic but important task for knowledge extraction from unstructured text. FET has been studied extensively in natural language processing and typically relies on human-annotated corpora for training, which is costly and difficult to scale. Recent studies explore the utilization of pre-trained language models (PLMs) as a knowledge base to generate rich and context-aware weak supervision for FET. However, a PLM still requires direction and guidance to serve as a knowledge base as they often generate a mixture of rough and fine-grained types, or tokens unsuitable for typing. In this study, we vision that an ontology provides a semantics-rich, hierarchical structure, which will help select the best results generated by multiple PLM models and head words. Specifically, we propose a novel annotation-free, ontology-guided FET method, ONTOTYPE, which follows a type ontological structure, from coarse to fine, ensembles multiple PLM prompting results to generate a set of type candidates, and refines its type resolution, under the local context with a natural language inference model. Our experiments on the Ontonotes, FIGER, and NYT datasets using their associated ontological structures demonstrate that our method outperforms the state-of-the-art zero-shot fine-grained entity typing methods as well as a typical LLM method, ChatGPT. Our error analysis shows that refinement of the existing ontology structures will further improve fine-grained entity typing. 
    more » « less
  5. Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection. 
    more » « less