skip to main content


Title: Towards evaluating complex ontology alignments
Abstract The development of semi-automated and automated ontology alignment techniques is an important part of realizing the potential of the Semantic Web. Until very recently, most existing work in this area was focused on finding simple (1:1) equivalence correspondences between two ontologies. However, many real-world ontology pairs involve correspondences that contain multiple entities from each ontology. These ‘complex’ alignments pose a challenge for existing evaluation approaches, which hinders the development of new systems capable of finding such correspondences. This position paper surveys and analyzes the requirements for effective evaluation of complex ontology alignments and assesses the degree to which these requirements are met by existing approaches. It also provides a roadmap for future work on this topic taking into consideration emerging community initiatives and major challenges that need to be addressed.  more » « less
Award ID(s):
1936677
NSF-PAR ID:
10191791
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
The Knowledge Engineering Review
Volume:
35
ISSN:
0269-8889
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. 
    more » « less
  2. Tang, P. ; Grau, D. ; El Asmar, M. (Ed.)
    Existing automated code checking (ACC) systems require the extraction of requirements from regulatory textual documents into computer-processable rule representations. The information extraction processes in those ACC systems are based on either human interpretation, manual annotation, or predefined automated information extraction rules. Despite the high performance they showed, rule-based information extraction approaches, by nature, lack sufficient scalability—the rules typically need some level of adaptation if the characteristics of the text change. Machine learning-based methods, instead of relying on hand-crafted rules, automatically capture the underlying patterns of the existing training text and have a great capability of generalizing to a variety of texts. A more scalable, machine learning-based approach is thus needed to achieve a more robust performance across different types of codes/documents for automatically generating semantically-enriched building-code sentences for the purpose of ACC. To address this need, this paper proposes a machine learning-based approach for generating semantically-enriched building-code sentences, which are annotated syntactically and semantically, for supporting IE. For improved robustness and scalability, the proposed approach uses transfer learning strategies to train deep neural network models on both general-domain and domain-specific data. The proposed approach consists of four steps: (1) data preparation and preprocessing; (2) development of a base deep neural network model for generating semantically-enriched building-code sentences; (3) model training using transfer learning strategies; and (4) model evaluation. The proposed approach was evaluated on a corpus of sentences from the 2009 International Building Code (IBC) and the Champaign 2015 IBC Amendments. The preliminary results show that the proposed approach achieved an optimal precision of 88%, recall of 86%, and F1-measure of 87%, indicating good performance. 
    more » « less
  3. Most of the existing automated code compliance checking (ACC) methods are unable to fully automatically convert complex building-code requirements into computer-processable forms. Such complex requirements usually have hierarchically complex clause and sentence structures. There is, thus, a need to decompose such complex requirements into hierarchies of much smaller, manageable requirement units that would be processable using most of the existing ACC methods. Rule-based methods have been used to deal with such complex requirements and have achieved high performance. However, they lack scalability, because the rules are developed manually and need to be updated and/or adapted when applied to a different type of building code. More research is, thus, needed to develop a scalable method to automatically convert the complex requirements into hierarchies of requirement units to facilitate the succeeding steps of ACC such as information extraction and compliance reasoning. To address this need, this paper proposes a new, machine learning-based method to automatically extract requirement hierarchies from building codes. The proposed method consists of five main steps: (1) data preparation and preprocessing; (2) data adaptation; (3) deep neural network model training for dependency parsing; (4) automated requirement segmentation and restriction interpretation based on the extracted dependencies; and (5) evaluation. The proposed method was trained using the English Treebank data; and was tested on sentences from the 2009 International Building Code (IBC) and the Champaign 2015 IBC Amendments. The preliminary results show that the proposed method achieved an average normalized edit distance of 0.32, a precision of 89%, a recall of 76%, and an F1-measure of 82%, which indicates good requirement hierarchy extraction performance. 
    more » « less
  4. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

     
    more » « less
  5. In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes. 
    more » « less