skip to main content


Title: Enumerating consistent sub-graphs of directed acyclic graphs: an insight into biomedical ontologies
Abstract Motivation

Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontology can be seen as directed acyclic graphs (DAGs), where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent sub-graph; that is, a sub-graph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent sub-graphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.

Results

We propose an algorithm for enumerating consistent sub-graphs of DAGs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.

Availability and implementation

https://github.com/shawn-peng/counting-consistent-sub-DAG

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
NSF-PAR ID:
10413615
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
34
Issue:
13
ISSN:
1367-4803
Page Range / eLocation ID:
p. i313-i322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract

    Extracting an individual’s knowledge structure is a challenging task as it requires formalization of many concepts and their interrelationships. While there has been significant research on how to represent knowledge to support computational design tasks, there is limited understanding of the knowledge structures of human designers. This understanding is necessary for comprehension of cognitive tasks such as decision making and reasoning, and for improving educational programs. In this paper, we focus on quantifying theory-based causal knowledge, which is a specific type of knowledge held by human designers. We develop a probabilistic graph-based model for representing individuals’ concept-specific causal knowledge for a given theory. We propose a methodology based on probabilistic directed acyclic graphs (DAGs) that uses logistic likelihood function for calculating the probability of a correct response. The approach involves a set of questions for gathering responses from 205 engineering students, and a hierarchical Bayesian approach for inferring individuals’ DAGs from the observed responses. We compare the proposed model to a baseline three-parameter logistic (3PL) model from the item response theory. The results suggest that the graph-based logistic model can estimate individual students’ knowledge graphs. Comparisons with the 3PL model indicate that knowledge assessment is more accurate when quantifying knowledge at the level of causal relations than quantifying it using a scalar ability parameter. The proposed model allows identification of parts of the curriculum that a student struggles with and parts they have already mastered which is essential for remediation.

     
    more » « less
  2. Extracting an individual’s knowledge structure is a challenging task as it requires formalization of many concepts and their interrelationships. While there has been significant research on how to represent knowledge to support computational design tasks, there is limited understanding of the knowledge structures of human designers. This understanding is necessary for comprehension of cognitive tasks such as decision making and reasoning, and for improving educational programs. In this paper, we focus on quantifying theory-based causal knowledge, which is a specific type of knowledge held by human designers. We develop a probabilistic graph-based model for representing individuals’ concept-specific causal knowledge for a given theory. We propose a methodology based on probabilistic directed acyclic graphs (DAGs) that uses logistic likelihood function for calculating the probability of a correct response. The approach involves a set of questions for gathering responses from 205 engineering students, and a hierarchical Bayesian approach for inferring individuals’ DAGs from the observed responses. We compare the proposed model to a baseline three-parameter logistic (3PL) model from the item response theory. The results suggest that the graph-based logistic model can estimate individual students’ knowledge graphs. Comparisons with the 3PL model indicate that knowledge assessment is more accurate when quantifying knowledge at the level of causal relations than quantifying it using a scalar ability parameter. The proposed model allows identification of parts of the curriculum that a student struggles with and parts they have already mastered which is essential for remediation. 
    more » « less
  3. Abstract Motivation

    The size of a genome graph—the space required to store the nodes, node labels and edges—affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs.

    Results

    We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel–Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit.

    Availability

    The RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called “modifiers”. With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such ontologies can also be used to guide term usage in future publications. Spatial and method modifiers are the subjects of ontologies that already have been developed or are under development. In this work, frequency (e.g., rarely, usually), certainty (e.g., probably, definitely), degree (e.g., slightly, extremely), and coverage modifiers (e.g., sparsely, entirely) are collected, reviewed, and used to create two modifier ontologies with different design considerations. The basic goal is to express the sequential relationships within a type of modifiers, for example, usually is more frequent than rarely, in order to allow data annotated with ontology terms to be classified accordingly. Method: Two designs are proposed for the ontology, both using the list pattern: a closed ordered list (i.e., five-bin design) and an open ordered list design. The five-bin design puts the modifier terms into a set of 5 fixed bins with interval object properties, for example, one_level_more/less_frequently_than, where new terms can only be added as synonyms to existing classes. The open list approach starts with 5 bins, but supports the extensibility of the list via ordinal properties, for example, more/less_frequently_than, allowing new terms to be inserted as a new class anywhere in the list. The consequences of the different design decisions are discussed in the paper. CharaParser was used to extract modifiers from plant, ant, and other taxonomic descriptions. After a manual screening, 130 modifier words were selected as the candidate terms for the modifier ontologies. Four curators/experts (three biologists and one information scientist specialized in biosemantics) reviewed and categorized the terms into 20 bins using the Ontology Term Organizer (OTO) (http://biosemantics.arizona.edu/OTO). Inter-curator variations were reviewed and expressed in the final ontologies. Results: Frequency, certainty, degree, and coverage terms with complete agreement among all curators were used as class labels or exact synonyms. Terms with different interpretations were either excluded or included using “broader synonym” or “not recommended” annotation properties. These annotations explicitly allow for the user to be aware of the semantic ambiguity associated with the terms and whether they should be used with caution or avoided. Expert categorization results showed that 16 out of 20 bins contained terms with full agreements, suggesting differentiating the modifiers into 5 levels/bins balances the need to differentiate modifiers and the need for the ontology to reflect user consensus. Two ontologies, developed using the Protege ontology editor, are made available as OWL files and can be downloaded from https://github.com/biosemantics/ontologies. Contribution: We built the first two modifier ontologies following a consensus-based approach with terms commonly used in taxonomic literature. The five-bin ontology has been used in the Explorer of Taxon Concepts web toolkit to compute the similarity between characters extracted from literature to facilitate taxon concepts alignments. The two ontologies will also be used in an ontology-informed authoring tool for taxonomists to facilitate consistency in modifier term usage. 
    more » « less
  5. Agrawal, Garima (Ed.)
    Cybersecurity education is exceptionally challenging as it involves learning the complex attacks; tools and developing critical problem-solving skills to defend the systems. For a student or novice researcher in the cybersecurity domain, there is a need to design an adaptive learning strategy that can break complex tasks and concepts into simple representations. An AI-enabled automated cybersecurity education system can improve cognitive engagement and active learning. Knowledge graphs (KG) provide a visual representation in a graph that can reason and interpret from the underlying data, making them suitable for use in education and interactive learning. However, there are no publicly available datasets for the cybersecurity education domain to build such systems. The data is present as unstructured educational course material, Wiki pages, capture the flag (CTF) writeups, etc. Creating knowledge graphs from unstructured text is challenging without an ontology or annotated dataset. However, data annotation for cybersecurity needs domain experts. To address these gaps, we made three contributions in this paper. First, we propose an ontology for the cybersecurity education domain for students and novice learners. Second, we develop AISecKG, a triple dataset with cybersecurity-related entities and relations as defined by the ontology. This dataset can be used to construct knowledge graphs to teach cybersecurity and promote cognitive learning. It can also be used to build downstream applications like recommendation systems or self-learning question-answering systems for students. The dataset would also help identify malicious named entities and their probable impact. Third, using this dataset, we show a downstream application to extract custom-named entities from texts and educational material on cybersecurity. 
    more » « less