skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Mohanty, Somya D"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Ontologies are critical for organizing and interpreting complex domain-specific knowledge, with applications in data integration, functional prediction, and knowledge discovery. As the manual curation of ontology annotations becomes increasingly infeasible due to the exponential growth of biomedical and genomic data, natural language processing (NLP)-based systems have emerged as scalable alternatives. Evaluating these systems requires robust semantic similarity metrics that account for hierarchical and partially correct relationships often present in ontology annotations. This study explores the integration of graph-based and language-based embeddings to enhance the performance of semantic similarity metrics. Combining embeddings generated via Node2Vec and large language models (LLMs) with traditional semantic similarity metrics, we demonstrate that hybrid approaches effectively capture both structural and semantic relationships within ontologies. Our results show that combined similarity metrics outperform individual metrics, achieving high accuracy in distinguishing child–parent pairs from random pairs. This work underscores the importance of robust semantic similarity metrics for evaluating and optimizing NLP-based ontology annotation systems. Future research should explore the real-time integration of these metrics and advanced neural architectures to further enhance scalability and accuracy, advancing ontology-driven analyses in biomedical research and beyond. 
    more » « less
    Free, publicly-accessible full text available March 1, 2026
  2. Abstract There is a growing interest in using social media content for Natural Language Processing applications. However, it is not easy to computationally identify the most relevant set of tweets related to any specific event. Challenging semantics coupled with different ways for using natural language in social media make it difficult for retrieving the most relevant set of data from any social media outlet. This paper seeks to demonstrate a way to present the changing semantics of Twitter within the context of a crisis event, specifically tweets during Hurricane Irma. These methods can be used to identify the most relevant corpus of text for analysis in relevance to a specific incident such as a hurricane. Using an implementation of the Word2Vec method of Neural Network training mechanisms to create Word Embeddings, this paper will: discuss how the relative meaning of words changes as events unfold; present a mechanism for scoring tweets based upon dynamic, relative context relatedness; and show that similarity between words is not necessarily static. We present different methods for training the vector model in Word2Vec for identification of the most relevant tweets for any search query. The impact of tuning parameters such as Word Window Size, Minimum Word Frequency, Hidden Layer Dimensionality, and Negative Sampling on model performance was explored. The window containing the local maximum for AU_ROC for each parameter serves as a guide for other studies using the methods presented here for social media data analysis. 
    more » « less
  3. Abstract BackgroundAnnotating scientific literature with ontology concepts is a critical task in biology and several other domains for knowledge discovery. Ontology based annotations can power large-scale comparative analyses in a wide range of applications ranging from evolutionary phenotypes to rare human diseases to the study of protein functions. Computational methods that can tag scientific text with ontology terms have included lexical/syntactic methods, traditional machine learning, and most recently, deep learning. ResultsHere, we present state of the art deep learning architectures based on Gated Recurrent Units for annotating text with ontology concepts. We use the Colorado Richly Annotated Full Text Corpus (CRAFT) as a gold standard for training and testing. We explore a number of additional information sources including NCBI’s BioThesauraus and Unified Medical Language System (UMLS) to augment information from CRAFT for increasing prediction accuracy. Our best model results in a 0.84 F1 and semantic similarity. ConclusionThe results shown here underscore the impact for using deep learning architectures for automatically recognizing ontology concepts from literature. The augmentation of the models with biological information beyond that present in the gold standard corpus shows a distinct improvement in prediction accuracy. 
    more » « less
  4. null (Ed.)
  5. null (Ed.)
  6. Abstract Classifying images using supervised machine learning (ML) relies on labeled training data—classes or text descriptions, for example, associated with each image. Data‐driven models are only as good as the data used for training, and this points to the importance of high‐quality labeled data for developing a ML model that has predictive skill. Labeling data is typically a time‐consuming, manual process. Here, we investigate the process of labeling data, with a specific focus on coastal aerial imagery captured in the wake of hurricanes that affected the Atlantic and Gulf Coasts of the United States. The imagery data set is a rich observational record of storm impacts and coastal change, but the imagery requires labeling to render that information accessible. We created an online interface that served labelers a stream of images and a fixed set of questions. A total of 1,600 images were labeled by at least two or as many as seven coastal scientists. We used the resulting data set to investigate interrater agreement: the extent to which labelers labeled each image similarly. Interrater agreement scores, assessed with percent agreement and Krippendorff's alpha, are higher when the questions posed to labelers are relatively simple, when the labelers are provided with a user manual, and when images are smaller. Experiments in interrater agreement point toward the benefit of multiple labelers for understanding the uncertainty in labeling data for machine learning research. 
    more » « less