skip to main content

Title: Gene regulation network inference using k-nearest neighbor-based mutual information estimation: revisiting an old DREAM
Abstract Background

A cell exhibits a variety of responses to internal and external cues. These responses are possible, in part, due to the presence of an elaborate gene regulatory network (GRN) in every single cell. In the past 20 years, many groups worked on reconstructing the topological structure of GRNs from large-scale gene expression data using a variety of inference algorithms. Insights gained about participating players in GRNs may ultimately lead to therapeutic benefits. Mutual information (MI) is a widely used metric within this inference/reconstruction pipeline as it can detect any correlation (linear and non-linear) between any number of variables (n-dimensions). However, the use of MI with continuous data (for example, normalized fluorescence intensity measurement of gene expression levels) is sensitive to data size, correlation strength and underlying distributions, and often requires laborious and, at times, ad hoc optimization.


In this work, we first show that estimating MI of a bi- and tri-variate Gaussian distribution usingk-nearest neighbor (kNN) MI estimation results in significant error reduction as compared to commonly used methods based on fixed binning. Second, we demonstrate that implementing the MI-based kNN Kraskov–Stoögbauer–Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR). Finally, through extensive in-silico benchmarking we show that a new inference algorithm CMIA (Conditional Mutual Information Augmentation), inspired by CLR, in combination with the KSG-MI estimator, outperforms commonly used methods.


Using three canonical datasets containing 15 synthetic networks, the newly developed method for GRN reconstruction—which combines CMIA, and the KSG-MI estimator—achieves an improvement of 20–35% in precision-recall measures over the current gold standard in the field. This new method will enable researchers to discover new gene interactions or better choose gene candidates for experimental validations.

more » « less
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Gene regulatory networks (GRNs) in a cell provide the tight feedback needed to synchronize cell actions. However, genes in a cell also take input from, and provide signals to other neighboring cells. These cell–cell interactions (CCIs) and the GRNs deeply influence each other. Many computational methods have been developed for GRN inference in cells. More recently, methods were proposed to infer CCIs using single cell gene expression data with or without cell spatial location information. However, in reality, the two processes do not exist in isolation and are subject to spatial constraints. Despite this rationale, no methods currently exist to infer GRNs and CCIs using the same model.


    We propose CLARIFY, a tool that takes GRNs as input, uses them and spatially resolved gene expression data to infer CCIs, while simultaneously outputting refined cell-specific GRNs. CLARIFY uses a novel multi-level graph autoencoder, which mimics cellular networks at a higher level and cell-specific GRNs at a deeper level. We applied CLARIFY to two real spatial transcriptomic datasets, one using seqFISH and the other using MERFISH, and also tested on simulated datasets from scMultiSim. We compared the quality of predicted GRNs and CCIs with state-of-the-art baseline methods that inferred either only GRNs or only CCIs. The results show that CLARIFY consistently outperforms the baseline in terms of commonly used evaluation metrics. Our results point to the importance of co-inference of CCIs and GRNs and to the use of layered graph neural networks as an inference tool for biological networks.

    Availability and implementation

    The source code and data is available at

    more » « less
  2. Abstract Motivation

    Gene regulatory networks (GRNs) of the same organism can be different under different conditions, although the overall network structure may be similar. Understanding the difference in GRNs under different conditions is important to understand condition-specific gene regulation. When gene expression and other relevant data under two different conditions are available, they can be used by an existing network inference algorithm to estimate two GRNs separately, and then to identify the difference between the two GRNs. However, such an approach does not exploit the similarity in two GRNs, and may sacrifice inference accuracy.


    In this paper, we model GRNs with the structural equation model (SEM) that can integrate gene expression and genetic perturbation data, and develop an algorithm named fused sparse SEM (FSSEM), to jointly infer GRNs under two conditions, and then to identify difference of the two GRNs. Computer simulations demonstrate that the FSSEM algorithm outperforms the approaches that estimate two GRNs separately. Analysis of a dataset of lung cancer and another dataset of gastric cancer with FSSEM inferred differential GRNs in cancer versus normal tissues, whose genes with largest network degrees have been reported to be implicated in tumorigenesis. The FSSEM algorithm provides a valuable tool for joint inference of two GRNs and identification of the differential GRN under two conditions.

    Availability and implementation

    The R package fssemR implementing the FSSEM algorithm is available at It is also available on CRAN.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  3. Abstract Motivation

    Cell function is regulated by gene regulatory networks (GRNs) defined by protein-mediated interaction between constituent genes. Despite advances in experimental techniques, we can still measure only a fraction of the processes that govern GRN dynamics. To infer the properties of GRNs using partial observation, unobserved sequential processes can be replaced with distributed time delays, yielding non-Markovian models. Inference methods based on the resulting model suffer from the curse of dimensionality.


    We develop a simulation-based Bayesian MCMC method employing an approximate likelihood for the efficient and accurate inference of GRN parameters when only some of their products are observed. We illustrate our approach using a two-step activation model: an activation signal leads to the accumulation of an unobserved regulatory protein, which triggers the expression of observed fluorescent proteins. With prior information about observed fluorescent protein synthesis, our method successfully infers the dynamics of the unobserved regulatory protein. We can estimate the delay and kinetic parameters characterizing target regulation including transcription, translation, and target searching of an unobserved protein from experimental measurements of the products of its target gene. Our method is scalable and can be used to analyze non-Markovian models with hidden components.

    Availability and implementation

    Our code is implemented in R and is freely available with a simple example data at

    more » « less
  4. Summary

    Predicting gene regulatory networks (GRNs) from expression profiles is a common approach for identifying important biological regulators. Despite the increased use of inference methods, existing computational approaches often do not integrate RNA‐sequencing data analysis, are not automated or are restricted to users with bioinformatics backgrounds. To address these limitations, we developedtuxnet, a user‐friendly platform that can process raw RNA‐sequencing data from any organism with an existing reference genome using a modifiedtuxedopipeline (hisat 2 + cufflinkspackage) and infer GRNs from these processed data.tuxnetis implemented as a graphical user interface and can mine gene regulations, either by applying a dynamic Bayesian network (DBN) inference algorithm,genist, or a regression tree‐based pipeline,rtp‐star. We obtained time‐course expression data of aPERIANTHIA(PAN) inducible line and inferred a GRN usinggenistto illustrate the use oftuxnetwhile gaining insight into the regulations downstream of the Arabidopsis root stem cell regulatorPAN. Usingrtp‐star, we inferred the network ofATHB13, a downstream gene of PAN, for which we obtained wild‐type and mutant expression profiles. Additionally, we generated two networks using temporal data from developmental leaf data and spatial data from root cell‐type data to highlight the use oftuxnetto form new testable hypotheses from previously explored data. Our case studies feature the versatility oftuxnetwhen using different types of gene expression data to infer networks and its accessibility as a pipeline for non‐bioinformaticians to analyze transcriptome data, predict causal regulations, assess network topology and identify key regulators.

    more » « less
  5. Inferring gene regulatory networks (GRNs) from single-cell gene expression datasets is a challenging task. Existing methods are often designed heuristically for specific datasets and lack the flexibility to incorporate additional information or compare against other algorithms. Further, current GRN inference methods do not provide uncertainty estimates with respect to the interactions that they predict, making inferred networks challenging to interpret. To overcome these challenges, we introduce Probabilistic Matrix Factorization for Gene Regulatory Network inference (PMF-GRN). PMF-GRN uses single-cell gene expression data to learn latent factors representing transcription factor activity as well as regulatory relationships between transcription factors and their target genes. This approach incorporates available experimental evidence into prior distributions over latent factors and scales well to single-cell gene expression datasets. By utilizing variational inference, we facilitate hyperparameter search for principled model selection and direct comparison to other generative models. To assess the accuracy of our method, we evaluate PMF-GRN using the model organisms Saccharomyces cerevisiae and Bacillus subtilis, benchmarking against database-derived gold standard interactions. We discover that, on average, PMF-GRN infers GRNs more accurately than current state-of-the-art single-cell GRN inference methods. Moreover, our PMF-GRN approach offers well-calibrated uncertainty estimates, as it performs gene regulatory network (GRN) inference in a probabilistic setting. These estimates are valuable for validation purposes, particularly when validated interactions are limited or a gold standard is incomplete. 
    more » « less