skip to main content


Title: TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering
Taxonomy construction is not only a fundamental task for semantic analysis of text corpora, but also an important step for applications such as information filtering, recommendation, and Web search. Existing pattern-based methods extract hypernym-hyponym term pairs and then organize these pairs into a taxonomy. However, by considering each term as an independent concept node, they overlook the topical proximity and the semantic correlations among terms. In this paper, we propose a method for constructing topic taxonomies, wherein every node represents a conceptual topic and is defined as a cluster of semantically coherent concept terms. Our method, TaxoGen, uses term embeddings and hierarchical clustering to construct a topic taxonomy in a recursive fashion. To ensure the quality of the recursive process, it consists of: (1) an adaptive spherical clustering module for allocating terms to proper levels when splitting a coarse topic into fine-grained ones; (2) a local embedding module for learning term embeddings that maintain strong discriminative power at different levels of the taxonomy. Our experiments on two real datasets demonstrate the effectiveness of TaxoGen compared with baseline methods.  more » « less
Award ID(s):
1741317 1704532 1618481
NSF-PAR ID:
10079174
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018
Volume:
2018
Issue:
1
Page Range / eLocation ID:
2701 to 2709
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The automated construction of topic taxonomies can benefit numerous applications, including web search, recommendation, and knowledge discovery. One of the major advantages of automatic taxonomy construction is the ability to capture corpus-specific information and adapt to different scenarios. To better reflect the characteristics of a corpus, we take the meta-data of documents into consideration and view the corpus as a text-rich network. In this paper, we propose NetTaxo, a novel automatic topic taxonomy construction framework, which goes beyond the existing paradigm and allows text data to collaborate with network structure. Specifically, we learn term embeddings from both text and network as contexts. Network motifs are adopted to capture appropriate network contexts. We conduct an instance-level selection for motifs, which further refines term embedding according to the granularity and semantics of each taxonomy node. Clustering is then applied to obtain sub-topics under a taxonomy node. Extensive experiments on two real-world datasets demonstrate the superiority of our method over the state-of-the-art, and further verify the effectiveness and importance of instance-level motif selection. 
    more » « less
  2. Abstract Motivation Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. Results We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug–drug interaction (DDI) prediction, protein–protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performance without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks. Availability and implementation As part of our contributions in the paper, we develop an easy-to-use Python package with detailed instructions, BioNEV, available at: https://github.com/xiangyue9607/BioNEV, including all source code and datasets, to facilitate studying various graph embedding methods on biomedical tasks. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. null (Ed.)
    Network representation learning (NRL) is crucial in the area of graph learning. Recently, graph autoencoders and its variants have gained much attention and popularity among various types of node embedding approaches. Most existing graph autoencoder-based methods aim to minimize the reconstruction errors of the input network while not explicitly considering the semantic relatedness between nodes. In this paper, we propose a novel network embedding method which models the consistency across different views of networks. More specifically, we create a second view from the input network which captures the relation between nodes based on node content and enforce the latent representations from the two views to be consistent by incorporating a multiview adversarial regularization module. The experimental studies on benchmark datasets prove the effectiveness of this method, and demonstrate that our method compares favorably with the state-of-the-art algorithms on challenging tasks such as link prediction and node clustering. We also evaluate our method on a real-world application, i.e., 30-day unplanned ICU readmission prediction, and achieve promising results compared with several baseline methods. 
    more » « less
  4. Graphs are powerful representations for relations among objects, which have attracted plenty of attention in both academia and industry. A fundamental challenge for graph learning is how to train an effective Graph Neural Network (GNN) encoder without labels, which are expensive and time consuming to obtain. Contrastive Learning (CL) is one of the most popular paradigms to address this challenge, which trains GNNs by discriminating positive and negative node pairs. Despite the success of recent CL methods, there are still two under-explored problems. Firstly, how to reduce the semantic error introduced by random topology based data augmentations. Traditional CL defines positive and negative node pairs via the node-level topological proximity, which is solely based on the graph topology regardless of the semantic information of node attributes, and thus some semantically similar nodes could be wrongly treated as negative pairs. Secondly, how to effectively model the multiplexity of the real-world graphs, where nodes are connected by various relations and each relation could form a homogeneous graph layer. To solve these problems, we propose a novel multiplex heterogeneous graph prototypical contrastive leaning (X-GOAL) framework to extract node embeddings. X-GOAL is comprised of two components: the GOAL framework, which learns node embeddings for each homogeneous graph layer, and an alignment regularization, which jointly models different layers by aligning layer-specific node embeddings. Specifically, the GOAL framework captures the node-level information by a succinct graph transformation technique, and captures the cluster-level information by pulling nodes within the same semantic cluster closer in the embedding space. The alignment regularization aligns embeddings across layers at both node level and cluster level. We evaluate the proposed X-GOAL on a variety of real-world datasets and downstream tasks to demonstrate the effectiveness of the X-GOAL framework. 
    more » « less
  5. Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to all words or attributes. In this paper, we argue that considering topic-level semantic interactions between nodes is crucial to learn discriminative node embedding vectors. In order to model pairwise topic relevance between linked text nodes, we propose topical network embedding, where interactions between nodes are built on the shared latent topics. Accordingly, we propose a unified optimization framework to simultaneously learn topic and node representations from the network text contents and structures, respectively. Meanwhile, the structure modeling takes the learned topic representations as conditional context under the principle that two nodes can infer each other contingent on the shared latent topics. Experiments on three real-world datasets demonstrate that our approach can learn significantly better network representations, i.e., 4.1% improvement over the state-of-the-art methods in terms of Micro-F1 on Cora dataset. (The source code of the proposed method is available through the github link: https:// github.com/codeshareabc/TopicalNE.) 
    more » « less