skip to main content


Title: Statistical inference of discrete combinatorial functional dependency in biological systems
Inference of a combinatorial function from multiple independent variables (parents) to a dependent variable (child) in a discrete space can be useful in detecting nonlinear relationships in biological systems. Popular conditional independency measures, heavily used in combinatorial inference, are often insensitive to the direction of functional dependency. To address this issue, we define multivariate and conditional functional chi-squared statistics. We also present an algorithm called CFDF for bivariate discrete function inference via an exclusive-effect strategy, in order to identify a best parent set for a given child. It requires each parent to make sufficient contribution beyond any marginal effect. Simulation studies suggest a marked advantage of our framework over alternatives. Applying the method to transcriptome data in genetically perturbed biological systems, we reproduced combinatorial gene interactions known in the literature. Most importantly, we identified combinatorial patterns from joint RNA and protein data to rebut a dispute on the founding principle of molecular biology.  more » « less
Award ID(s):
1661331
NSF-PAR ID:
10168084
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 14th Machine Learning in Computational Biology (MLCB) Meeting
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Because the average treatment effect (ATE) measures the change in social welfare, even if positive, there is a risk of negative effect on, say, some 10% of the population. Assessing such risk is difficult, however, because any one individual treatment effect (ITE) is never observed, so the 10% worst-affected cannot be identified, whereas distributional treatment effects only compare the first deciles within each treatment group, which does not correspond to any 10% subpopulation. In this paper, we consider how to nonetheless assess this important risk measure, formalized as the conditional value at risk (CVaR) of the ITE distribution. We leverage the availability of pretreatment covariates and characterize the tightest possible upper and lower bounds on ITE-CVaR given by the covariate-conditional average treatment effect (CATE) function. We then proceed to study how to estimate these bounds efficiently from data and construct confidence intervals. This is challenging even in randomized experiments as it requires understanding the distribution of the unknown CATE function, which can be very complex if we use rich covariates to best control for heterogeneity. We develop a debiasing method that overcomes this and prove it enjoys favorable statistical properties even when CATE and other nuisances are estimated by black box machine learning or even inconsistently. Studying a hypothetical change to French job search counseling services, our bounds and inference demonstrate a small social benefit entails a negative impact on a substantial subpopulation. This paper was accepted by J. George Shanthikumar, data science. Funding: This work was supported by the Division of Information and Intelligent Systems [Grant 1939704]. Supplemental Material: The data files and online appendices are available at https://doi.org/10.1287/mnsc.2023.4819 . 
    more » « less
  2. Abstract Motivation

    High-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have become the basic practice for genomic studies in biomedical research. In addition to studying genes individually, for example, through differential expression analysis, investigating co-ordinated expression variations of genes may help reveal the underlying cellular mechanisms to derive better understanding and more effective prognosis and intervention strategies. Although there exists a variety of co-expression network based methods to analyze microarray data for this purpose, instead of blindly extending these methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns.

    Results

    We have developed a fully Bayesian covariate-dependent negative binomial factor analysis (dNBFA) method—dNBFA—for RNA-seq count data, to capture coordinated gene expression changes, while considering effects from covariates reflecting different influencing factors. Unlike existing co-expression network based methods, our proposed model does not require multiple ad-hoc choices on data processing, transformation, as well as co-expression measures and can be directly applied to RNA-seq data. Furthermore, being capable of incorporating covariate information, the proposed method can tackle setups with complex confounding factors in different experiment designs. Finally, the natural model parameterization removes the need for a normalization preprocessing step, as commonly adopted to compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool for discovering the gene modules with significant differential expression and meaningful biological insight.

    Availability and implementation

    dNBFA is implemented in R language and is available at https://github.com/siamakz/dNBFA.

     
    more » « less
  3. As severe dropout in single-cell RNA sequencing (scRNA-seq) degrades data quality, current methods for network inference face increased uncertainty from such data. To examine how dropout influences directional dependency inference from scRNA-seq data, we thus studied four methods based on discrete data that are model-free without parametric model assumptions. They include two established methods: conditional entropy and Kruskal-Wallis test, and two recent methods: causal inference by stochastic complexity and function index. We also included three non-directional methods for a contrast. On simulated data, function index performed most favorably at varying dropout rates, sample sizes, and discrete levels. On an scRNA-seq dataset from developing mouse cerebella, function index and Kruskal-Wallis test performed favorably over other methods in detecting expression of developmental genes as a function of time. Overall among the four methods, function index is most resistant to dropout for both directional and dependency inference. The next best choice, Kruskal-Wallis test, carries a directional bias towards a uniformly distributed variable. We conclude that a method robust to marginal distributions with a sufficiently large sample size can reap benefits of single-cell over bulk RNA sequencing in understanding molecular mechanisms at the cellular resolution. 
    more » « less
  4. null (Ed.)
    Network embedding aims at transferring node proximity in networks into distributed vectors, which can be leveraged in various downstream applications. Recent research has shown that nodes in a network can often be organized in latent hierarchical structures, but without a particular underlying taxonomy, the learned node embedding is less useful nor interpretable. In this work, we aim to improve network embedding by modeling the conditional node proximity in networks indicated by node labels residing in real taxonomies. In the meantime, we also aim to model the hierarchical label proximity in the given taxonomies, which is too coarse by solely looking at the hierarchical topologies. To this end, we propose TAXOGAN to co-embed network nodes and hierarchical labels, through a hierarchical network generation process. Particularly, TAXOGAN models the child labels and network nodes of each parent label in an individual embedding space while learning to transfer network proximity among the spaces of hierarchical labels through stacked network generators and embedding encoders. To enable robust and efficient model inference, we further develop a hierarchical adversarial training process. Comprehensive experiments and case studies on four real-world datasets of networks with hierarchical labels demonstrate the utility of TAXOGAN in improving network embedding on traditional tasks of node classification and link prediction, as well as novel tasks like conditional proximity search and fine-grained taxonomy layout. 
    more » « less
  5. Background

    Research to date has largely conceptualized irritability in terms of intraindividual differences. However, the role of interpersonal dyadic processes has received little consideration. Nevertheless, difficulties in how parent–child dyads synchronize during interactions may be an important correlate of irritably in early childhood. Innovations in developmentally sensitive neuroimaging methods now enable the use of measures of neural synchrony to quantify synchronous responses in parent–child dyads and can help clarify the neural underpinnings of these difficulties. We introduce the Disruptive Behavior Diagnostic Observation Schedule: Biological Synchrony (DB‐DOS:BioSync) as a paradigm for exploring parent–child neural synchrony as a potential biological mechanism for interpersonal difficulties in preschool psychopathology.

    Methods

    Using functional near‐infrared spectroscopy (fNIRS) 4‐ to 5‐year‐olds (N = 116) and their mothers completed the DB‐DOS:BioSync while assessing neural synchrony during mild frustration and recovery. Child irritability was measured using a latent irritability factor that was calculated from four developmentally sensitive indicators.

    Results

    Both the mild frustration and the recovery contexts resulted in neural synchrony. However, less neural synchrony during the recovery context only was associated with more child irritability.

    Conclusions

    Our results suggest that recovering after a frustrating period might be particularly challenging for children high in irritability and offer support for the use of the DB‐DOS:BioSync task to elucidate interpersonal neural mechanisms of developmental psychopathology.

     
    more » « less