Creators/Authors contains: "Kumar, Sajal"

  1. Countering prior beliefs that epistasis is rare, genomics advancements suggest the other way. Current practice often filters out genomic loci with low variant counts before detecting epistasis. We argue that this practice is far from optimal because it can throw away strong epistatic patterns. Instead, we present the compensated Sharma–Song test to infer genetic epistasis in genome-wide association studies by differential departure from independence. The test does not require a minimum number of replicates for each variant. We also introduce algorithms to simulate epistatic patterns that differentially depart from independence. Using two simulators, the test performed comparably to the original Sharma–Song test when variant frequencies at a locus are marginally uniform; encouragingly, it has a marked advantage over alternatives when variant frequencies are marginally nonuniform. The test further revealed uniquely clean epistatic variants associated with chicken abdominal fat content that are not prioritized by other methods. Genes involved in most numbers of inferred epistasis between single nucleotide polymorphisms (SNPs) belong to pathways known for obesity regulation; many top SNPs are located on chromosome 20 and in intergenic regions. Measuring differential departure from independence, the compensated Sharma–Song test offers a practical choice for studying epistasis robust to nonuniform genetic variant frequencies.
  2. Abstract Motivation

    Computer inference of biological mechanisms is increasingly approachable due to dynamically rich data sources such as single-cell genomics. Inferred molecular interactions can prioritize hypotheses for wet-lab experiments to expedite biological discovery. However, complex data often come with unwanted biological or technical variations, exposing biases over marginal distribution and sample size in current methods to favor spurious causal relationships.


    Considering function direction and strength as evidence for causality, we present an adapted functional chi-squared test (AdpFunChisq) that rewards functional patterns over non-functional or independent patterns. On synthetic and three biology datasets, we demonstrate the advantages of AdpFunChisq over 10 methods on overcoming biases that give rise to wide fluctuations in the performance of alternative approaches. On single-cell multiomics data of multiple phenotype acute leukemia, we found that the T-cell surface glycoprotein CD3 delta chain may causally mediate specific genes in the viral carcinogenesis pathway. Using the causality-by-functionality principle, AdpFunChisq offers a viable option for robust causal inference in dynamical systems.

    Availability and implementation

    The AdpFunChisq test is implemented in the R package ‘FunChisq’ (2.5.2 or above) at All other source code along with pre-processed data is available at Code Ocean

    Supplementary information

    Supplementary materials are available at Bioinformatics online.

    Abstract Motivation Genetic or epigenetic events can rewire molecular networks to induce extraordinary phenotypical divergences. Among the many network rewiring approaches, no model-free statistical methods can differentiate gene-gene pattern changes not attributed to marginal changes. This may obscure fundamental rewiring from superficial changes. Results Here we introduce a model-free Sharma-Song test to determine if patterns differ in the second order, meaning that the deviation of the joint distribution from the product of marginal distributions is unequal across conditions. We prove an asymptotic chi-squared null distribution for the test statistic. Simulation studies demonstrate its advantage over alternative methods in detecting second-order differential patterns. Applying the test on three independent mammalian developmental transcriptome datasets, we report a lower frequency of co-expression network rewiring between human and mouse for the same tissue group than the frequency of rewiring between tissue groups within the same species. We also find secondorder differential patterns between microRNA promoters and genes contrasting cerebellum and liver development in mice. These patterns are enriched in the spliceosome pathway regulating tissue specificity. Complementary to previous mammalian comparative studies mostly driven by first-order effects, our findings contribute an understanding of system-wide second-order gene network rewiring within and across mammalian systems. Second-order differentialmore »patterns constitute evidence for fundamentally rewired biological circuitry due to evolution, environment, or disease. Availability The generic Sharma-Song test is available from the R package ‘DiffXTables’ at Other code and data are described in Methods. Supplementary information Supplementary data are available at Bioinformatics online.« less
  4. The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at
  5. Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our workmore »provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at Other source code and lists of significant gene patterns are available at« less
  6. Inference of a combinatorial function from multiple independent variables (parents) to a dependent variable (child) in a discrete space can be useful in detecting nonlinear relationships in biological systems. Popular conditional independency measures, heavily used in combinatorial inference, are often insensitive to the direction of functional dependency. To address this issue, we define multivariate and conditional functional chi-squared statistics. We also present an algorithm called CFDF for bivariate discrete function inference via an exclusive-effect strategy, in order to identify a best parent set for a given child. It requires each parent to make sufficient contribution beyond any marginal effect. Simulation studies suggest a marked advantage of our framework over alternatives. Applying the method to transcriptome data in genetically perturbed biological systems, we reproduced combinatorial gene interactions known in the literature. Most importantly, we identified combinatorial patterns from joint RNA and protein data to rebut a dispute on the founding principle of molecular biology.
  7. As severe dropout in single-cell RNA sequencing (scRNA-seq) degrades data quality, current methods for network inference face increased uncertainty from such data. To examine how dropout influences directional dependency inference from scRNA-seq data, we thus studied four methods based on discrete data that are model-free without parametric model assumptions. They include two established methods: conditional entropy and Kruskal-Wallis test, and two recent methods: causal inference by stochastic complexity and function index. We also included three non-directional methods for a contrast. On simulated data, function index performed most favorably at varying dropout rates, sample sizes, and discrete levels. On an scRNA-seq dataset from developing mouse cerebella, function index and Kruskal-Wallis test performed favorably over other methods in detecting expression of developmental genes as a function of time. Overall among the four methods, function index is most resistant to dropout for both directional and dependency inference. The next best choice, Kruskal-Wallis test, carries a directional bias towards a uniformly distributed variable. We conclude that a method robust to marginal distributions with a sufficiently large sample size can reap benefits of single-cell over bulk RNA sequencing in understanding molecular mechanisms at the cellular resolution.
  8. RNA-binding proteins (RBPs) participate in all stages of RNA life cycle from transcription, splicing, to translation. Under the ENCODE project, a large number of RBPs were knocked down in human cancer cell lines, offering an excellent opportunity to infer targets of RBPs. Taking both RBP binding sites and RNA-seq profiles of RBP knockdown samples as input, we present a pipeline to identify causal RBP RNA interactions. The pipeline employs a recent functional chi-square test (FunChisq) that deciphers directional association, and utilizes a novel functional index that measures the effect size of functional dependency. We examined ∼45 million RBP RNA pairs in leukemia (K562) and liver cancer (HepG2) cell lines for functional patterns as causal interaction candidates. Here, we report a total of 936,707 RBP RNA pairs in the two cell lines that show statistically significant linear or nonlinear functional patterns. About 31% of these pairs have supportive biological evidence from other sources, suggesting the effectiveness of the pipeline. The interactions constitute RBP specific regulatory networks that may potentially represent core mechanisms in the two cancers. The pipeline is implemented through an R interface with pre-computed results and data libraries for users to query specific networks and visualize RBP RNA interactions.more »Such networks serve as a useful resource for studying RNA dysregulation in cancer.« less