skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature
Abstract BackgroundAnnotating scientific literature with ontology concepts is a critical task in biology and several other domains for knowledge discovery. Ontology based annotations can power large-scale comparative analyses in a wide range of applications ranging from evolutionary phenotypes to rare human diseases to the study of protein functions. Computational methods that can tag scientific text with ontology terms have included lexical/syntactic methods, traditional machine learning, and most recently, deep learning. ResultsHere, we present state of the art deep learning architectures based on Gated Recurrent Units for annotating text with ontology concepts. We use the Colorado Richly Annotated Full Text Corpus (CRAFT) as a gold standard for training and testing. We explore a number of additional information sources including NCBI’s BioThesauraus and Unified Medical Language System (UMLS) to augment information from CRAFT for increasing prediction accuracy. Our best model results in a 0.84 F1 and semantic similarity. ConclusionThe results shown here underscore the impact for using deep learning architectures for automatically recognizing ontology concepts from literature. The augmentation of the models with biological information beyond that present in the gold standard corpus shows a distinct improvement in prediction accuracy.  more » « less
Award ID(s):
1942727
PAR ID:
10372515
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BioData Mining
Volume:
15
Issue:
1
ISSN:
1756-0381
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract ObjectivesThe predictive intensive care unit (ICU) scoring system is crucial for predicting patient outcomes, particularly mortality. Traditional scoring systems rely mainly on structured clinical data from electronic health records, which can overlook important clinical information in narratives and images. Materials and MethodsIn this work, we build a deep learning-based survival prediction model that utilizes multimodality data for ICU mortality prediction. Four sets of features are investigated: (1) physiological measurements of Simplified Acute Physiology Score (SAPS) II, (2) common thorax diseases predefined by radiologists, (3) bidirectional encoder representations from transformers-based text representations, and (4) chest X-ray image features. The model was evaluated using the Medical Information Mart for Intensive Care IV dataset. ResultsOur model achieves an average C-index of 0.7829 (95% CI, 0.7620-0.8038), surpassing the baseline using only SAPS-II features, which had a C-index of 0.7470 (95% CI: 0.7263-0.7676). Ablation studies further demonstrate the contributions of incorporating predefined labels (2.00% improvement), text features (2.44% improvement), and image features (2.82% improvement). Discussion and ConclusionThe deep learning model demonstrated superior performance to traditional machine learning methods under the same feature fusion setting for ICU mortality prediction. This study highlights the potential of integrating multimodal data into deep learning models to enhance the accuracy of ICU mortality prediction. 
    more » « less
  2. We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction 
    more » « less
  3. Abstract Exploring new techniques to improve the prediction of tropical cyclone (TC) formation is essential for operational practice. Using convolutional neural networks, this study shows that deep learning can provide a promising capability for predicting TC formation from a given set of large-scale environments at certain forecast lead times. Specifically, two common deep-learning architectures including the residual net (ResNet) and UNet are used to examine TC formation in the Pacific Ocean. With a set of large-scale environments extracted from the NCEP–NCAR reanalysis during 2008–21 as input and the TC labels obtained from the best track data, we show that both ResNet and UNet reach their maximum forecast skill at the 12–18-h forecast lead time. Moreover, both architectures perform best when using a large domain covering most of the Pacific Ocean for input data, as compared to a smaller subdomain in the western Pacific. Given its ability to provide additional information about TC formation location, UNet performs generally worse than ResNet across the accuracy metrics. The deep learning approach in this study presents an alternative way to predict TC formation beyond the traditional vortex-tracking methods in the current numerical weather prediction. Significance StatementThis study presents a new approach for predicting tropical cyclone (TC) formation based on deep learning (DL). Using two common DL architectures in visualization research and a set of large-scale environments in the Pacific Ocean extracted from the reanalysis data, we show that DL has an optimal capability of predicting TC formation at the 12–18-h lead time. Examining the DL performance for different domain sizes shows that the use of a large domain size for input data can help capture some far-field information needed for predicting TCG. The DL approach in this study demonstrates an alternative way to predict or detect TC formation beyond the traditional vortex-tracking methods used in the current numerical weather prediction. 
    more » « less
  4. Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor’s notes that has been annotated with ICD concepts. Extensive experimental evaluation shows our model achieves significant performance improvements over several state-of-the-art techniques. 
    more » « less
  5. Abstract BackgroundThe number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. ResultsBy using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. ConclusionOur study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme. 
    more » « less