skip to main content


Title: Hypothesis Testing for Automated Community Detection in Networks
Summary

Community detection in networks is a key exploratory tool with applications in a diverse set of areas, ranging from finding communities in social and biological networks to identifying link farms in the World Wide Web. The problem of finding communities or clusters in a network has received much attention from statistics, physics and computer science. However, most clustering algorithms assume knowledge of the number of clusters k. We propose to determine k automatically in a graph generated from a stochastic block model by using a hypothesis test of independent interest. Our main contribution is twofold; first, we theoretically establish the limiting distribution of the principal eigenvalue of the suitably centred and scaled adjacency matrix and use that distribution for our test of the hypothesis that a random graph is of Erdős–Rényi (noise) type. Secondly, we use this test to design a recursive bipartitioning algorithm, which naturally uncovers nested community structure. Using simulations and quantifiable classification tasks on real world networks with ground truth, we show that our algorithm outperforms state of the art methods.

 
more » « less
NSF-PAR ID:
10397454
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
78
Issue:
1
ISSN:
1369-7412
Page Range / eLocation ID:
p. 253-273
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Label Propagation is not only a well-known machine learning algorithm for classification, but it is also an effective method for discovering communities and connected components in networks. We propose a new Direction-Optimizing Label Propagation Algorithm (DOLPA) framework that enhances the performance of the standard Label Propagation Algorithm (LPA), increases its scalability, and extends its versatility and application scope. As a central feature, the DOLPA framework relies on the use of frontiers and alternates between label push and label pull operations to attain high performance. It is formulated in such a way that the same basic algorithm can be used for finding communities or connected components in graphs by only changing the objective function used. Additionally, DOLPA has parameters for tuning the processing order of vertices in a graph to reduce the number of edges visited and improve the quality of solution obtained. We present the design and implementation of the enhanced algorithm as well as our shared-memory parallelization of it using OpenMP. We also present an extensive experimental evaluation of our implementations using the LFR benchmark and real-world networks drawn from various domains. Compared with an implementation of LPA for community detection available in a widely used network analysis software, we achieve at most five times the F-Score while maintaining similar runtime for graphs with overlapping communities. We also compare DOLPA against an implementation of the Louvain method for community detection using the same LFR-graphs and show that DOLPA achieves about three times the F-Score at just 10% of the runtime. For connected component decomposition, our algorithm achieves orders of magnitude speedups over the basic LP-based algorithm on large diameter graphs, up to 13.2 × speedup over the Shiloach-Vishkin algorithm, and up to 1.6 × speedup over Afforest on an Intel Xeon processor using 40 threads. 
    more » « less
  2. Finding the network biomarkers of cancers and the analysis of cancer driving genes that are involved in these biomarkers are essential for understanding the dynamics of cancer. Clusters of genes in co-expression networks are commonly known as functional units. This work is based on the hypothesis that the dense clusters or communities in the gene co-expression networks of cancer patients may represent functional units regarding cancer initiation and progression. In this study, RNA-seq gene expression data of three cancers - Breast Invasive Carcinoma (BRCA), Colorectal Adenocarcinoma (COAD) and Glioblastoma Multiforme (GBM) - from The Cancer Genome Atlas (TCGA) are used to construct gene co-expression networks using Pearson Correlation. Six well-known community detection algorithms are applied on these networks to identify communities with five or more genes. A permutation test is performed to further mine the communities that are conserved in other cancers, thus calling them conserved communities. Then survival analysis is performed on clinical data of three cancers using the conserved community genes as prognostic co-variates. The communities that could distinguish the cancer patients between high- and low-risk groups are considered as cancer biomarkers. In the present study, 16 such network biomarkers are discovered. 
    more » « less
  3. Graphs/Networks are common in real-world applications where data have rich content and complex relationships. The increasing popularity also motivates many network learning algorithms, such as community detection, clustering, classification, and embedding learning, etc.. In reality, the large network volumes often hider a direct use of learning algorithms to the graphs. As a result, it is desirable to have the flexibility to condense a network to an arbitrary size, with well-preserved network topology and node content information. In this paper, we propose a graph compression network (GEN) to achieve network compression and embedding at the same time. Our theme is to leverage the network topology to find node mappings, such that densely connected nodes, including their node content, are compressed as a new node, with a latent vector (i.e. embedding) being learned to represent the compressed node. In addition to compression learning, we also develop a novel encoding-decoding framework, using feature diffusion process, to "decompress" the condensed network. Different from traditional graph convolution which uses direct-neighbor message passing, our decompression advocates high-order message passing within compressed nodes to learning feature representation for all nodes in the network. A unique strength of GEN is that it leverages the graph neural network principle to learn mapping automatically, so one can compress a network to an arbitrary size, and also decompress it to the original node space with minimum information loss. Experiments and comparisons confirm that GEN can automatically find clusters and communities, and compress them as new nodes. Results also show that GEN achieves improved performance for numerous tasks, including graph classification and node clustering. 
    more » « less
  4. Abstract Background

    Microbes play vital roles across coral reefs both in the environment and inside and upon macrobes (holobionts), where they support critical functions such as nutrition and immune system modulation. These roles highlight the potential ecosystem-level importance of microbes, yet most knowledge of microbial functions on reefs is derived from a small set of holobionts such as corals and sponges. Declining seawater pH — an important global coral reef stressor — can cause ecosystem-level change on coral reefs, providing an opportunity to study the role of microbes at this scale. We use an in situ experimental approach to test the hypothesis that under such ocean acidification (OA), known shifts among macrobe trophic and functional groups may drive a general ecosystem-level response extending across macrobes and microbes, leading to reduced distinctness between the benthic holobiont community microbiome and the environmental microbiome.

    Results

    We test this hypothesis using genetic and chemical data from benthic coral reef community holobionts sampled across a pH gradient from CO2seeps in Papua New Guinea. We find support for our hypothesis; under OA, the microbiome and metabolome of the benthic holobiont community become less compositionally distinct from the sediment microbiome and metabolome, suggesting that benthic macrobe communities are colonised by environmental microbes to a higher degree under OA conditions. We also find a simplification and homogenisation of the benthic photosynthetic community, and an increased abundance of fleshy macroalgae, consistent with previously observed reef microbialisation.

    Conclusions

    We demonstrate a novel structural shift in coral reefs involving macrobes and microbes: that the microbiome of the benthic holobiont community becomes less distinct from the sediment microbiome under OA. Our findings suggest that microbialisation and the disruption of macrobe trophic networks are interwoven general responses to environmental stress, pointing towards a universal, undesirable, and measurable form of ecosystem change.

     
    more » « less
  5. We study a new connection between a technical measure called $\mu$-conductance that arises in the study of Markov chains for sampling convex bodies and the network community profile that characterizes size-resolved properties of clusters and communities in social and information networks. The idea of $\mu$-conductance is similar to the traditional graph conductance, but disregards sets with small volume. We derive a sequence of optimization problems including a low-rank semi-definite program from which we can derive a lower bound on the optimal $\mu$-conductance value. These ideas give the first theoretically sound bound on the behavior of the network community profile for a wide range of cluster sizes. The algorithm scales up to graphs with hundreds of thousands of nodes and we demonstrate how our framework validates the predicted structures of real-world graphs. 
    more » « less