skip to main content

Title: Statistical inference of discrete combinatorial functional dependency in biological systems
Inference of a combinatorial function from multiple independent variables (parents) to a dependent variable (child) in a discrete space can be useful in detecting nonlinear relationships in biological systems. Popular conditional independency measures, heavily used in combinatorial inference, are often insensitive to the direction of functional dependency. To address this issue, we define multivariate and conditional functional chi-squared statistics. We also present an algorithm called CFDF for bivariate discrete function inference via an exclusive-effect strategy, in order to identify a best parent set for a given child. It requires each parent to make sufficient contribution beyond any marginal effect. Simulation studies suggest a marked advantage of our framework over alternatives. Applying the method to transcriptome data in genetically perturbed biological systems, we reproduced combinatorial gene interactions known in the literature. Most importantly, we identified combinatorial patterns from joint RNA and protein data to rebut a dispute on the founding principle of molecular biology.
Authors:
;
Award ID(s):
1661331
Publication Date:
NSF-PAR ID:
10168084
Journal Name:
Proceedings of the 14th Machine Learning in Computational Biology (MLCB) Meeting
Sponsoring Org:
National Science Foundation
More Like this
  1. As severe dropout in single-cell RNA sequencing (scRNA-seq) degrades data quality, current methods for network inference face increased uncertainty from such data. To examine how dropout influences directional dependency inference from scRNA-seq data, we thus studied four methods based on discrete data that are model-free without parametric model assumptions. They include two established methods: conditional entropy and Kruskal-Wallis test, and two recent methods: causal inference by stochastic complexity and function index. We also included three non-directional methods for a contrast. On simulated data, function index performed most favorably at varying dropout rates, sample sizes, and discrete levels. On an scRNA-seq dataset from developing mouse cerebella, function index and Kruskal-Wallis test performed favorably over other methods in detecting expression of developmental genes as a function of time. Overall among the four methods, function index is most resistant to dropout for both directional and dependency inference. The next best choice, Kruskal-Wallis test, carries a directional bias towards a uniformly distributed variable. We conclude that a method robust to marginal distributions with a sufficiently large sample size can reap benefits of single-cell over bulk RNA sequencing in understanding molecular mechanisms at the cellular resolution.
  2. Network embedding aims at transferring node proximity in networks into distributed vectors, which can be leveraged in various downstream applications. Recent research has shown that nodes in a network can often be organized in latent hierarchical structures, but without a particular underlying taxonomy, the learned node embedding is less useful nor interpretable. In this work, we aim to improve network embedding by modeling the conditional node proximity in networks indicated by node labels residing in real taxonomies. In the meantime, we also aim to model the hierarchical label proximity in the given taxonomies, which is too coarse by solely looking at the hierarchical topologies. To this end, we propose TAXOGAN to co-embed network nodes and hierarchical labels, through a hierarchical network generation process. Particularly, TAXOGAN models the child labels and network nodes of each parent label in an individual embedding space while learning to transfer network proximity among the spaces of hierarchical labels through stacked network generators and embedding encoders. To enable robust and efficient model inference, we further develop a hierarchical adversarial training process. Comprehensive experiments and case studies on four real-world datasets of networks with hierarchical labels demonstrate the utility of TAXOGAN in improving network embedding on traditionalmore »tasks of node classification and link prediction, as well as novel tasks like conditional proximity search and fine-grained taxonomy layout.« less
  3. Abstract
    The COVID-19 pandemic has dramatically altered family life in the United States. Over the long duration of the pandemic, parents had to adapt to shifting work conditions, virtual schooling, the closure of daycare facilities, and the stress of not only managing households without domestic and care supports but also worrying that family members may contract the novel coronavirus. Reports early in the pandemic suggest that these burdens have fallen disproportionately on mothers, creating concerns about the long-term implications of the pandemic for gender inequality and mothers’ well-being. Nevertheless, less is known about how parents’ engagement in domestic labor and paid work has changed throughout the pandemic, what factors may be driving these changes, and what the long-term consequences of the pandemic may be for the gendered division of labor and gender inequality more generally. <br /><br />The Study on U.S. Parents’ Divisions of Labor During COVID-19 (SPDLC) collects longitudinal survey data from partnered U.S. parents that can be used to assess changes in parents’ divisions of domestic labor, divisions of paid labor, and well-being throughout and after the COVID-19 pandemic. The goal of SPDLC is to understand both the short- and long-term impacts of the pandemic for the genderedMore>>
  4. Growing economic disparities and the increased sorting of families into economically segregated communities have heightened the need to clearly delineate pathways through which family income promotes children’s development. Combining hypotheses from investment and stress theories, we developed and tested a multi-context and cross-domain conceptual model assessing how community and family contexts mediate links between family income and children’s cognitive and behavioral skills at kindergarten entry. We drew data on family income, parenting processes, and child functioning from the Early Childhood Longitudinal Study– Birth Cohort (ECLS-B; N ≈ 10,650), following children from infancy through age 5. We used Geographic Information Systems technology to create and validate community measures using administrative data from the Economic Census, Decennial Census, National Center of Education Statistics, Federal Bureau of Investigations, and Environmental Protection Agency, which were then linked to each child in the ECLS-B. Using structural equation modeling, our analyses revealed three primary lessons. First, lower-income children have limited access to community educational and cultural resources and heightened exposure to community stressors including concentrated disadvantage and violent crime. Second, these community features are associated with parenting processes, such that parent-child interactions tend to be less stimulating and supportive and more punitive in communities with fewermore »resources and heightened stressors. And third, community and family contexts together mediate connections between family income and children’s cognitive and behavioral functioning. Results, albeit showing small effect sizes, provide a more complex, multi-contextual view than prior research, delineating the role of both resources and stressors at community and family levels in explaining income disparities in young children’s developmental success.« less
  5. The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters