skip to main content


Title: Heterogeneous information networks: the past, the present, and the future
In 2011, we proposed PathSim to systematically define and compute similarity between nodes in a heterogeneous information network (HIN), where nodes and links are from different types. In the PathSim paper, we for the first time introduced HIN with general network schema and proposed the concept of meta-paths to systematically define new relation types between nodes. In this paper, we summarize the impact of PathSim paper in both academia and industry. We start from the algorithms that are based on meta-path-based feature engineering, then move on to the recent development in heterogeneous network representation learning, including both shallow network embedding and heterogeneous graph neural networks. In the end, we make the connection between knowledge graphs and HINs and discuss the implication of meta-paths in the symbolic reasoning scenario. Finally, we point out several future directions.  more » « less
Award ID(s):
1705169 1937599 2211557
NSF-PAR ID:
10379398
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
15
Issue:
12
ISSN:
2150-8097
Page Range / eLocation ID:
3807 to 3811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)

    As heterogeneous networks have become increasingly ubiquitous, Heterogeneous Information Network (HIN) embedding, aiming to project nodes into a low-dimensional space while preserving the heterogeneous structure, has drawn increasing attention in recent years. Many of the existing HIN embedding methods adopt meta-path guided random walk to retain both the semantics and structural correlations between different types of nodes. However, the selection of meta-paths is still an open problem, which either depends on domain knowledge or is learned from label information. As a uniform blueprint of HIN, the network schema comprehensively embraces the high-order structure and contains rich semantics. In this paper, we make the first attempt to study network schema preserving HIN embedding, and propose a novel model named NSHE. In NSHE, a network schema sampling method is first proposed to generate sub-graphs (i.e., schema instances), and then multi-task learning task is built to preserve the heterogeneous structure of each schema instance. Besides preserving pairwise structure information, NSHE is able to retain high-order structure (i.e., network schema). Extensive experiments on three real-world datasets demonstrate that our proposed model NSHE significantly outperforms the state-of-the-art methods.

     
    more » « less
  2. In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% ofmicro-f1 in multi-label node classification and 5% to 70.8% of MAP in link prediction. 
    more » « less
  3. Heterogeneous Information Network (HIN), where nodes and their attributes denote real-world entities and links encode relationships between entities, are ubiquitous in many applications. The presence of multiple types of nodes and links pose significant challenges to the state-of-the-art methods for learning node embeddings from heterogeneous graphs. To address these challenges, we consider three variants of graph variational autoencoder models for heterogeneous networks that avoid the computationally expensive sampling of meta-paths. The proposed methods also maintain uncertainty estimates of node embeddings that help improve generalization performance. We report the results of experiments on link prediction using three different real-world heterogeneous network benchmark data sets that show that the proposed methods significantly outperform state-of-the-art baselines. 
    more » « less
  4. null (Ed.)
    Scientific literature, as one of the major knowledge resources, provides abundant textual evidence that has great potential to support high-quality scientific hypothesis validation. In this paper, we study the problem of textual evidence mining in scientific literature: given a scientific hypothesis as a query triplet, find the textual evidence sentences in scientific literature that support the input query. A critical challenge for textual evidence mining in scientific literature is to retrieve high-quality textual evidence without human supervision. Because it is non-trivial to obtain a large set of human-annotated articles con-taining evidence sentences in scientific literature. To tackle this challenge, we propose EVIDENCEMINER, a high-quality textual evidence retrieval method for scientific literature without human-annotated training examples. To achieve high-quality textual evidence retrieval, we leverage heterogeneous information from both existing knowledge bases and massive unstructured text. We propose to construct a large heterogeneous information network (HIN) to build connections between the user-input queries and the candidate evidence sentences. Based on the constructed HIN, we propose a novel HIN embedding method that directly embeds the nodes onto a spherical space to improve the retrieval performance. Quantitative experiments on a huge biomedical literature corpus (over 4 million sentences) demonstrate that EVIDENCEMINER significantly outperforms baseline methods for unsupervised textual evidence retrieval. Case studies also demonstrate that our HIN construction and embedding greatly benefit many downstream applications such as textual evidence interpretation and synonym meta-pattern discovery. 
    more » « less
  5. Online underground forums have been widely used by cybercriminals to trade the illicit products, resources and services, which have played a central role in the cybercrim-inal ecosystem. Unfortunately, due to the number of forums, their size, and the expertise required, it's infeasible to perform manual exploration to understand their behavioral processes. In this paper, we propose a novel framework named iDetector to automate the analysis of underground forums for the detection of cybercrime-suspected threads. In iDetector, to detect whether the given threads are cybercrime-suspected threads, we not only analyze the content in the threads, but also utilize the relations among threads, users, replies, and topics. To model this kind of rich semantic relationships (i.e., thread-user, thread-reply, thread-topic, reply-user and reply-topic relations), we introduce a structured heterogeneous information network (HIN) for representation, which is capable to be composed of different types of entities and relations. To capture the complex relationships (e.g., two threads are relevant if they were posted by the same user and discussed the same topic), we use a meta-structure based approach to characterize the semantic relatedness over threads. As different meta-structures depict the relatedness over threads at different views, we then build a classifier using Laplacian scores to aggregate different similarities formulated by different meta-structures to make predictions. To the best of our knowledge, this is the first work to use structural HIN to automate underground forum analysis. Comprehensive experiments on real data collections from underground forums (e.g., Hack Forums) are conducted to validate the effectiveness of our developed system iDetector in cybercrime-suspected thread detection by comparisons with other alternative methods. 
    more » « less