skip to main content


Title: Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0
Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities.  more » « less
Award ID(s):
1944247
NSF-PAR ID:
10330419
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Big Data and Cognitive Computing
Volume:
6
Issue:
1
ISSN:
2504-2289
Page Range / eLocation ID:
27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large networks are quintessential to bioinformatics, knowledge graphs, social network analysis, and graph-based learning. CompositeView is a Python-based open-source application that improves interactive complex network visualization and extraction of actionable insight. CompositeView utilizes specifically formatted input data to calculate composite scores and display them using the Cytoscape component of Dash. Composite scores are defined representations of smaller sets of conceptually similar data that, when combined, generate a single score to reduce information overload. Visualized interactive results are user-refined via filtering elements such as node value and edge weight sliders and graph manipulation options (e.g., node color and layout spread). The primary difference between CompositeView and other network visualization tools is its ability to auto-calculate and auto-update composite scores as the user interactively filters or aggregates data. CompositeView was developed to visualize network relevance rankings, but it performs well with non-network data. Three disparate CompositeView use cases are shown: relevance rankings from SemNet 2.0, an open-source knowledge graph relationship ranking software for biomedical literature-based discovery; Human Development Index (HDI) data; and the Framingham cardiovascular study. CompositeView was stress tested to construct reference benchmarks that define breadth and size of data effectively visualized. Finally, CompositeView is compared to Excel, Tableau, Cytoscape, neo4j, NodeXL, and Gephi. 
    more » « less
  2. Viral marketing on social networks, also known as Influence Maximization (IM), aims to select k users for the promotion of a target item by maximizing the total spread of their influence. However, most previous works on IM do not explore the dynamic user perception of promoted items in the process. In this paper, by exploiting the knowledge graph (KG) to capture dynamic user perception, we formulate the problem of Influence Maximization based on Dynamic Personal Perception (IMDPP) that considers user preferences and social influence reflecting the impact of relevant item adoptions. We prove the hardness of IMDPP and design an approximation algorithm, named Dynamic perception for seeding in target markets (Dysim), by exploring the concepts of dynamic reachability, target markets, and substantial influence to select and promote a sequence of relevant items. We evaluate the performance of Dysim in comparison with the state-of-the-art approaches using real social networks with real KGs. The experimental results show that Dysim effectively achieves at least 6 times of influence spread in large datasets over the state-of-the-art approaches. 
    more » « less
  3. null (Ed.)
    Knowledge Tracing (KT), which aims to model student knowledge level and predict their performance, is one of the most important applications of user modeling. Modern KT approaches model and maintain an up-to-date state of student knowledge over a set of course concepts according to students’ historical performance in attempting the problems. However, KT approaches were designed to model knowledge by observing relatively small problem-solving steps in Intelligent Tutoring Systems. While these approaches were applied successfully to model student knowledge by observing student solutions for simple problems, such as multiple-choice questions, they do not perform well for modeling complex problem solving in students. Most importantly, current models assume that all problem attempts are equally valuable in quantifying current student knowledge. However, for complex problems that involve many concepts at the same time, this assumption is deficient. It results in inaccurate knowledge states and unnecessary fluctuations in estimated student knowledge, especially if students guess the correct answer to a problem that they have not mastered all of its concepts or slip in answering the problem that they have already mastered all of its concepts. In this paper, we argue that not all attempts are equivalently important in discovering students’ knowledge state, and some attempts can be summarized together to better represent student performance. We propose a novel student knowledge tracing approach, Granular RAnk based TEnsor factorization (GRATE), that dynamically selects student attempts that can be aggregated while predicting students’ performance in problems and discovering the concepts presented in them. Our experiments on three real-world datasets demonstrate the improved performance of GRATE, compared to the state-of-the-art baselines, in the task of student performance prediction. Our further analysis shows that attempt aggregation eliminates the unnecessary fluctuations from students’ discovered knowledge states and helps in discovering complex latent concepts in the problems. 
    more » « less
  4. null (Ed.)
    Ad hoc table retrieval is the problem of identifying the most relevant datasets to a user's query. We present an approach to the problem that builds a knowledge graph by combining information about the collection of tables with external sources such as WordNet and pretrained Glove embeddings. We apply multi-relational graph convolutional networks to learn embeddings for the knowledge graph nodes and utilize three different methods to create vectors representing the tables and queries from these embeddings. We create a novel learning-to-rank neural architecture that incorporates the multiple embeddings in order to improve table retrieval results. We evaluate our approach using two large collections of tables from public WikiTables and Web tables data, demonstrating substantial improvements over state-of-the-art methods in table retrieval. 
    more » « less
  5. null (Ed.)
    Large-scale multiuser scientific facilities, such as geographically distributed observatories, remote instruments, and experimental platforms, represent some of the largest national investments and can enable dramatic advances across many areas of science. Recent examples of such advances include the detection of gravitational waves and the imaging of a black hole’s event horizon. However, as the number of such facilities and their users grow, along with the complexity, diversity, and volumes of their data products, finding and accessing relevant data is becoming increasingly challenging, limiting the potential impact of facilities. These challenges are further amplified as scientists and application workflows increasingly try to integrate facilities’ data from diverse domains. In this paper, we leverage concepts underlying recommender systems, which are extremely effective in e-commerce, to address these data-discovery and data-access challenges for large-scale distributed scientific facilities. We first analyze data from facilities and identify and model user-query patterns in terms of facility location and spatial localities, domain-specific data models, and user associations. We then use this analysis to generate a knowledge graph and develop the collaborative knowledge-aware graph attention network (CKAT) recommendation model, which leverages graph neural networks (GNNs) to explicitly encode the collaborative signals through propagation and combine them with knowledge associations. Moreover, we integrate a knowledge-aware neural attention mechanism to enable the CKAT to pay more attention to key information while reducing irrelevant noise, thereby increasing the accuracy of the recommendations. We apply the proposed model on two real-world facility datasets and empirically demonstrate that the CKAT can effectively facilitate data discovery, significantly outperforming several compelling state-of-the-art baseline models. 
    more » « less