skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: KinDER: A Biocuration Tool for Extracting Kinase Knowledge from Biomedical Literature
Kinases are enzymes that mediate phosphate transfer. Extracting information on kinases from biomedical literature is an important task which has direct implications for applications such as drug design. In this work, we develop KinDER, Kinase Document Extractor and Ranker, a biomedical natural language processing tool for extracting functional and disease related information on kinases. This tool combines information retrieval and machine learning techniques to automatically extract information about protein kinases. First, it uses several bio-ontologies to retrieve documents related to kinases and then uses a supervised classification model to rank them according to their relevance. This was developed to participate in the Text-mining services for Human Kinome Curation Track of the BioCreative VI challenge. According to the official BioCreative evaluation results, KinDER provides stateof- the-art performance for extracting functional information on kinases from abstracts.  more » « less
Award ID(s):
1658971
PAR ID:
10046547
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the BioCreative VI Workshop
Page Range / eLocation ID:
51-55
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kinases are responsible for phosphorylation of proteins and are involved in many biological processes, including cell signaling. Identifying the kinases that phosphorylate specific phosphoproteins is critical to augment the current understanding of cellular events. Herein, we report a general protocol to study the kinases of a target substrate phosphoprotein using kinase‐catalyzed crosslinking and immunoprecipitation (K‐CLIP). K‐CLIP uses a photocrosslinking γ‐phosphoryl‐modified ATP analog, such as ATP‐arylazide, to covalently crosslink substrates to kinases with UV irradiation. Crosslinked kinase‐substrate complexes can then be enriched by immunoprecipitating the target substrate phosphoprotein, with bound kinase(s) identified using Western blot or mass spectrometry analysis. K‐CLIP is an adaptable chemical tool to investigate and discover kinase‐substrate pairs, which will promote characterization of complex phosphorylation‐mediated cell biology. 
    more » « less
  2. A significant part of our knowledge is relationships between two terms. However, most of these information is documented as unstructured text in various forms, like books, online articles and webpages. Extract those information and store them in a structured database could help people utilize these information more conveniently. In this study, we proposed a novel approach to extract the relationships information based on Nature Language Processing (NLP) and graph theoretic algorithm. Our method, Grammatical Relationship Graph for Triplets (GRGT), extracts three layers of information: the pairs of terms that have certain relationship, exactly what type of the relationship is, and what direct this relationship is. GRGT works on a grammatical graph obtained by parsed the sentence using Natural Language Processing. Patterns were extracted from the graph by shortest path among the words of interests. We have designed a decision tree to make the pattern matching. GRGT was applied to extract the protein-protein-interactions (PPIs) from biomedical literature, and obtained better precision than the best performing method in literature. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bioentities. 
    more » « less
  3. Social media data have been increasingly used to study biomedical and health-related phenomena. From cohort-level discussions of a condition to population-level analyses of sentiment, social media have provided scientists with unprecedented amounts of data to study human behavior associated with a variety of health conditions and medical treatments. Here we review recent work in mining social media for biomedical, epidemiological, and social phenomena information relevant to the multilevel complexity of human health. We pay particular attention to topics where social media data analysis has shown the most progress, including pharmacovigilance and sentiment analysis, especially for mental health. We also discuss a variety of innovative uses of social media data for health-related applications as well as important limitations of social media data access and use. 
    more » « less
  4. Background: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. Methods: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. Results: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. Conclusions: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities. 
    more » « less
  5. Abstract Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0. 
    more » « less