skip to main content


Title: Network global testing by counting graphlets
Consider a large social network with possibly severe degree heterogeneity and mixed-memberships. We are interested in testing whether the network has only one community or there are more than one communities. The problem is known to be non-trivial, partially due to the presence of severe degree heterogeneity. We construct a class of test statistics using the numbers of short paths and short cycles, and the key to our approach is a general framework for canceling the effects of degree heterogeneity. The tests compare favorably with existing methods. We support our methods with careful analysis and numerical study with simulated data and a real data example.  more » « less
Award ID(s):
1925845 1712958
NSF-PAR ID:
10289815
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
80
ISSN:
2640-3498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    A network may have weak signals and severe degree heterogeneity, and may be very sparse in one occurrence but very dense in another. SCORE (Ann. Statist. 43, 57–89, 2015) is a recent approach to network community detection. It accommodates severe degree heterogeneity and is adaptive to different levels of sparsity, but its performance for networks with weak signals is unclear. In this paper, we show that in a broad class of network settings where we allow for weak signals, severe degree heterogeneity, and a wide range of network sparsity, SCORE achieves prefect clustering and has the so-called “exponential rate” in Hamming clustering errors. The proof uses the most recent advancement on entry-wise bounds for the leading eigenvectors of the network adjacency matrix. The theoretical analysis assures us that SCORE continues to work well in the weak signal settings, but it does not rule out the possibility that SCORE may be further improved to have better performance in real applications, especially for networks with weak signals. As a second contribution of the paper, we propose SCORE+ as an improved version of SCORE. We investigate SCORE+ with 8 network data sets and found that it outperforms several representative approaches. In particular, for the 6 data sets with relatively strong signals, SCORE+ has similar performance as that of SCORE, but for the 2 data sets (Simmons, Caltech) with possibly weak signals, SCORE+ has much lower error rates. SCORE+ proposes several changes to SCORE. We carefully explain the rationale underlying each of these changes, using a mixture of theoretical and numerical study. 
    more » « less
  2. SCORE was introduced as a spectral approach to network community detection. Since many networks have severe degree heterogeneity, the ordinary spectral clustering (OSC) approach to community detection may perform unsatisfactorily. SCORE alleviates the effect of degree heterogeneity by introducing a new normalization idea in the spectral domain and makes OSC more effective. SCORE is easy to use and computationally fast. It adapts easily to new directions and sees an increasing interest in practice. In this paper, we review the basics of SCORE, the adaption of SCORE to network mixed membership estimation and topic modeling, and the application of SCORE in real data, including two datasets on the publications of statisticians. We also review the theoretical “ideology” underlying SCORE. We show that in the spectral domain, SCORE converts a simplicial cone to a simplex and provides a simple and direct link between the simplex and network memberships. SCORE attains an exponential rate and a sharp phase transition in community detection, and achieves optimal rates in mixed membership estimation and topic modeling.

     
    more » « less
  3. Kovács, Ákos T. (Ed.)
    ABSTRACT In Bacillus subtilis , master regulator Spo0A controls several cell-differentiation pathways. Under moderate starvation, phosphorylated Spo0A (Spo0A~P) induces biofilm formation by indirectly activating genes controlling matrix production in a subpopulation of cells via an SinI-SinR-SlrR network. Under severe starvation, Spo0A~P induces sporulation by directly and indirectly regulating sporulation gene expression. However, what determines the heterogeneity of individual cell fates is not fully understood. In particular, it is still unclear why, despite being controlled by a single master regulator, biofilm matrix production and sporulation seem mutually exclusive on a single-cell level. In this work, with mathematical modeling, we showed that the fluctuations in the growth rate and the intrinsic noise amplified by the bistability in the SinI-SinR-SlrR network could explain the single-cell distribution of matrix production. Moreover, we predicted an incoherent feed-forward loop; the decrease in the cellular growth rate first activates matrix production by increasing in Spo0A phosphorylation level but then represses it via changing the relative concentrations of SinR and SlrR. Experimental data provide evidence to support model predictions. In particular, we demonstrate how the degree to which matrix production and sporulation appear mutually exclusive is affected by genetic perturbations. IMPORTANCE The mechanisms of cell-fate decisions are fundamental to our understanding of multicellular organisms and bacterial communities. However, even for the best-studied model systems we still lack a complete picture of how phenotypic heterogeneity of genetically identical cells is controlled. Here, using B. subtilis as a model system, we employ a combination of mathematical modeling and experiments to explain the population-level dynamics and single-cell level heterogeneity of matrix gene expression. The results demonstrate how the two cell fates, biofilm matrix production and sporulation, can appear mutually exclusive without explicitly inhibiting one another. Such a mechanism could be used in a wide range of other biological systems. 
    more » « less
  4. Federated learning (FL) is a promising strategy for performing privacy-preserving, distributed learning with a network of clients (i.e., edge devices). However, the data distribution among clients is often non-IID in nature, making efficient optimization difficult. To alleviate this issue, many FL algorithms focus on mitigating the effects of data heterogeneity across clients by introducing a variety of proximal terms, some incurring considerable compute and/or memory overheads, to restrain local updates with respect to the global model. Instead, we consider rethinking solutions to data heterogeneity in FL with a focus on local learning generality rather than proximal restriction. To this end, we first present a systematic study informed by second-order indicators to better understand algorithm effectiveness in FL. Interestingly, we find that standard regularization methods are surprisingly strong performers in mitigating data heterogeneity effects. Based on our findings, we further propose a simple and effective method, FedAlign, to overcome data heterogeneity and the pitfalls of previous methods. FedAlign achieves competitive accuracy with state-of-the-art FL methods across a variety of settings while minimizing computation and memory overhead. Code is available at https://github.com/mmendiet/FedAlign. 
    more » « less
  5. Abstract Background

    Targeted surveillance allows public health authorities to implement testing and isolation strategies when diagnostic resources are limited, and can be implemented via the consideration of social network topologies. However, it remains unclear how to implement such surveillance and control when network data are unavailable.

    Methods

    We evaluated the ability of sociodemographic proxies of degree centrality to guide prioritized testing of infected individuals compared to known degree centrality. Proxies were estimated via readily available sociodemographic variables (age, gender, marital status, educational attainment, household size). We simulated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) epidemics via a susceptible-exposed-infected-recovered individual-based model on 2 contact networks from rural Madagascar to test applicability of these findings to low-resource contexts.

    Results

    Targeted testing using sociodemographic proxies performed similarly to targeted testing using known degree centralities. At low testing capacity, using proxies reduced infection burden by 22%–33% while using 20% fewer tests, compared to random testing. By comparison, using known degree centrality reduced the infection burden by 31%–44% while using 26%–29% fewer tests.

    Conclusions

    We demonstrate that incorporating social network information into epidemic control strategies is an effective countermeasure to low testing capacity and can be implemented via sociodemographic proxies when social network data are unavailable.

     
    more » « less