skip to main content


Title: A Decentralized Environment for Biomedical Semantic Content Authoring and Publishing
The portable document format (PDF) is currently one of the most popular formats for offline sharing biomedical information. Recently, HTML-based formats for web-first biomedical information sharing have gained popularity. However, machine-interpretable information is required by literature search engines, such as Google Scholar, to index articles in a context-aware manner for accurate biomedical literature searches. The lack of technological infrastructure to add machine-interpretable metadata to expanding biomedical information, on the other hand, renders them unreachable to search engines. Therefore, we developed a portable technical infrastructure (goSemantically) and packaged it as a Google Docs add-ons. The “goSemantically” assists authors in adding machine-interpretable metadata at the terminology and document structural levels While authoring biomedical content. The “goSemantically” leverages the NCBO Bioportal resources and introduces a mechanism to annotate biomedical information with relevant machine-interpretable metadata (semantic vocabularies). The “goSemantically” also acquires schema.org meta tags designed for search engine optimization and tailored to accommodate biomedical information. Thus, individual authors can conveniently author and publish biomedical content in a truly decentralized fashion. Users can also export and host content with relevant machine-interpretable metadata (semantic vocabularies) in interoperable formats such as HTML and JSON-LD. To experience the described features, run this code with Google Doc  more » « less
Award ID(s):
2101350
NSF-PAR ID:
10467329
Author(s) / Creator(s):
; ; ;
Editor(s):
Agapito, G.
Publisher / Repository:
ICWE 2022. Communications in Computer and Information Science, vol 1668. Springer, Cham.
Date Published:
Format(s):
Medium: X
Location:
https://link.springer.com/chapter/10.1007/978-3-031-25380-5_6
Sponsoring Org:
National Science Foundation
More Like this
  1. An abundance of biomedical data is generated in the form of clinical notes, reports, and research articles available online. This data holds valuable information that requires extraction, retrieval, and transformation into actionable knowledge. However, this information has various access challenges due to the need for precise machine-interpretable semantic metadata required by search engines. Despite search engines' efforts to interpret the semantics information, they still struggle to index, search, and retrieve relevant information accurately. To address these challenges, we propose a novel graph-based semantic knowledge-sharing approach to enhance the quality of biomedical semantic annotation by engaging biomedical domain experts. In this approach, entities in the knowledge-sharing environment are interlinked and play critical roles. Authorial queries can be posted on the "Knowledge Cafe," and community experts can provide recommendations for semantic annotations. The community can further validate and evaluate the expert responses through a voting scheme resulting in a transformed "Knowledge Cafe" environment that functions as a knowledge graph with semantically linked entities. We evaluated the proposed approach through a series of scenarios, resulting in precision, recall, F1-score, and accuracy assessment matrices. Our results showed an acceptable level of accuracy at approximately 90%. The source code for "Semantically" is freely available at: https://github.com/bukharilab/Semantically 
    more » « less
  2. An abundance of biomedical data is generated in the form of clinical notes, reports, and research articles available online. This data holds valuable information that requires extraction, retrieval, and transformation into actionable knowledge. However, this information has various access challenges due to the need for precise machine-interpretable semantic metadata required by search engines. Despite search engines' efforts to interpret the semantics information, they still struggle to index, search, and retrieve relevant information accurately. To address these challenges, we propose a novel graph-based semantic knowledge-sharing approach to enhance the quality of biomedical semantic annotation by engaging biomedical domain experts. In this approach, entities in the knowledge-sharing environment are interlinked and play critical roles. Authorial queries can be posted on the "Knowledge Cafe," and community experts can provide recommendations for semantic annotations. The community can further validate and evaluate the expert responses through a voting scheme resulting in a transformed "Knowledge Cafe" environment that functions as a knowledge graph with semantically linked entities. We evaluated the proposed approach through a series of scenarios, resulting in precision, recall, F1-score, and accuracy assessment matrices. Our results showed an acceptable level of accuracy at approximately 90%. The source code for "Semantically" is freely available at: https://github.com/bukharilab/Semantically 
    more » « less
  3. Villazón-Terrazas, B. (Ed.)
    Each day a vast amount of unstructured content is generated in the biomedical domain from various sources such as clinical notes, research articles and medical reports. Such content contain a sufficient amount of efficient and meaningful information that needs to be converted into actionable knowledge for secondary use. However, accessing precise biomedical content is quite challenging because of content heterogeneity, missing and imprecise metadata and unavailability of associated semantic tags required for search engine optimization. We have introduced a socio-technical semantic annotation optimization approach that enhance the semantic search of biomedical contents. The proposed approach consist of layered architecture. At First layer (Preliminary Semantic Enrichment), it annotates the biomedical contents with the ontological concepts from NCBO BioPortal. With the growing biomedical information, the suggested semantic annotations from NCBO Bioportal are not always correct. Therefore, in the second layer (Optimizing the Enriched Semantic Information), we introduce a knowledge sharing scheme through which authors/users could request for recommendations from other users to optimize the semantic enrichment process. To guage the credibility of the the human recommended, our systems records the recommender confidence score, collects community voting against previous recommendations, stores percentage of correctly suggested annotation and translates that into an index to later connect right users to get suggestions to optimize the semantic enrichment of biomedical contents. At the preliminary layer of annotation from NCBO, we analyzed the n-gram strategy for biomedical word boundary identification. We have found that NCBO recognizes biomedical terms for n-gram-1 more than for n-gram-2 to n-gram-5. Similarly, a statistical measure conducted on significant features using the Wilson score and data normalization. In contrast, the proposed methodology achieves an suitable accuracy of ≈90% for the semantic optimization approach. 
    more » « less
  4. It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. 
    more » « less
  5. This study analyzes and compares how the digital semantic infrastructure of U.S. based digital news varies according to certain characteristics of the media outlet, including the community it serves, the content management system (CMS) it uses, and its institutional affiliation (or lack thereof). Through a multi-stage analysis of the actual markup found on news outlets’ online text articles, we reveal how multiple factors may be limiting the discoverability and reach of online media organizations focused on serving specific communities. Conceptually, we identify markup and metadata as aspects of the semantic infrastructure underpinning platforms’ mechanisms of distributing online news. Given the significant role that these platforms play in shaping the broader visibility of news content, we further contend that this markup therefore constitutes a kind of infrastructure of visibility by which news sources and voices are rendered accessible—or, conversely—invisible in the wider platform economy of journalism. We accomplish our analysis by first identifying key forms of digital markup whose structured data is designed to make online news articles more readily discoverable by search engines and social media platforms. We then analyze 2,226 digital news stories gathered from the main pages of 742 national, local, Black, and other identity-based news organizations in mid-2021, and analyze each for the presence of specific tags reflecting the Schema.org, OpenGraph, and Twitter metadata structures. We then evaluate the relationship between audience focus and the robustness of this digital semantic infrastructure. While we find only a weak relationship between the markup and the community served, additional analysis revealed a much stronger association between these metadata tags and content management system (CMS), in which 80% of the attributes appearing on an article were the same for a given CMS, regardless of publisher, market, or audience focus. Based on this finding, we identify the organizational characteristics that may influence the specific CMS used for digital publishing, and, therefore, the robustness of the digital semantic infrastructure deployed by the organization. Finally, we reflect on the potential implications of the highly disparate tag use we observe, particularly with respect to the broader visibility of online news designed to serve particular US communities. 
    more » « less