skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 AM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


Title: Automatic Extraction of Protein-protein Interactions using Grammatical Relationship Graph
Background: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. Methods: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. Results: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. Conclusions: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.  more » « less
Award ID(s):
1743142
PAR ID:
10095244
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
BMC medical informatics and decision making
Volume:
18
Issue:
2
ISSN:
1472-6947
Page Range / eLocation ID:
35-43
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A significant part of our knowledge is relationships between two terms. However, most of these information is documented as unstructured text in various forms, like books, online articles and webpages. Extract those information and store them in a structured database could help people utilize these information more conveniently. In this study, we proposed a novel approach to extract the relationships information based on Nature Language Processing (NLP) and graph theoretic algorithm. Our method, Grammatical Relationship Graph for Triplets (GRGT), extracts three layers of information: the pairs of terms that have certain relationship, exactly what type of the relationship is, and what direct this relationship is. GRGT works on a grammatical graph obtained by parsed the sentence using Natural Language Processing. Patterns were extracted from the graph by shortest path among the words of interests. We have designed a decision tree to make the pattern matching. GRGT was applied to extract the protein-protein-interactions (PPIs) from biomedical literature, and obtained better precision than the best performing method in literature. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bioentities. 
    more » « less
  2. Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.)
    This paper reports the first release of the UMR (Uniform Meaning Representation) data set. UMR is a graph-based meaning representation formalism consisting of a sentence-level graph and a document-level graph. The sentence-level graph represents predicate-argument structures, named entities, word senses, aspectuality of events, as well as person and number information for entities. The document-level graph represents coreferential, temporal, and modal relations that go beyond sentence boundaries. UMR is designed to capture the commonalities and variations across languages and this is done through the use of a common set of abstract concepts, relations, and attributes as well as concrete concepts derived from words from invidual languages. This UMR release includes annotations for six languages (Arapaho, Chinese, English, Kukama, Navajo, Sanapana) that vary greatly in terms of their linguistic properties and resource availability. We also describe on-going efforts to enlarge this data set and extend it to other genres and modalities. We also briefly describe the available infrastructure (UMR annotation guidelines and tools) that others can use to create similar data sets. 
    more » « less
  3. Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.)
    This paper reports the first release of the UMR (Uniform Meaning Representation) data set. UMR is a graph-based meaning representation formalism consisting of a sentence-level graph and a document-level graph. The sentence-level graph represents predicate-argument structures, named entities, word senses, aspectuality of events, as well as person and number information for entities. The document-level graph represents coreferential, temporal, and modal relations that go beyond sentence boundaries. UMR is designed to capture the commonalities and variations across languages and this is done through the use of a common set of abstract concepts, relations, and attributes as well as concrete concepts derived from words from invidual languages. This UMR release includes annotations for six languages (Arapaho, Chinese, English, Kukama, Navajo, Sanapana) that vary greatly in terms of their linguistic properties and resource availability. We also describe on-going efforts to enlarge this data set and extend it to other genres and modalities. We also briefly describe the available infrastructure (UMR annotation guidelines and tools) that others can use to create similar data sets. 
    more » « less
  4. Extractive text summarization aims at extract- ing the most representative sentences from a given document as its summary. To extract a good summary from a long text document, sen- tence embedding plays an important role. Re- cent studies have leveraged graph neural net- works to capture the inter-sentential relation- ship (e.g., the discourse graph) to learn con- textual sentence embedding. However, those approaches neither consider multiple types of inter-sentential relationships (e.g., semantic similarity & natural connection), nor model intra-sentential relationships (e.g, semantic & syntactic relationship among words). To ad- dress these problems, we propose a novel Mul- tiplex Graph Convolutional Network (Multi- GCN) to jointly model different types of rela- tionships among sentences and words. Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extrac- tive text summarization. Finally, we evaluate the proposed models on the CNN/DailyMail benchmark dataset to demonstrate the effec- tiveness of our method. 
    more » « less
  5. The COVID-19 pandemic caused by SARS-CoV-2 has emphasized the importance of studying virus-host protein-protein interactions (PPIs) and drug-target interactions (DTIs) to discover effective antiviral drugs. While several computational algorithms have been developed for this purpose, most of them overlook the interplay pathways during infection along PPIs and DTIs. In this paper, we present a novel multipartite graph learning approach to uncover hidden binding affinities in PPIs and DTIs. Our method leverages a comprehensive biomolecular mechanism network that integrates protein-protein, genetic, and virus-host interactions, enabling us to learn a new graph that accurately captures the underlying connected components. Notably, our method identifies clustering structures directly from the new graph, eliminating the need for post-processing steps. To mitigate the detrimental effects of noisy or outlier data in sparse networks, we propose a robust objective function that incorporates the L2,p-norm and a constraint based on the pth-order Ky-Fan norm applied to the graph Laplacian matrix. Additionally, we present an efficient optimization method tailored to our framework. Experimental results demonstrate the superiority of our approach over existing state-of-the-art techniques, as it successfully identifies potential repurposable drugs for SARS-CoV-2, offering promising therapeutic options for COVID-19 treatment. 
    more » « less