skip to main content


Title: Joint Grid Discretization for Biological Pattern Discovery
The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters  more » « less
Award ID(s):
1661331
NSF-PAR ID:
10236467
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings. The 11th ACM Int'l Conf on Bioinform, Comput Biol and Health Inform
Page Range / eLocation ID:
Article No.: 57
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Inference of a combinatorial function from multiple independent variables (parents) to a dependent variable (child) in a discrete space can be useful in detecting nonlinear relationships in biological systems. Popular conditional independency measures, heavily used in combinatorial inference, are often insensitive to the direction of functional dependency. To address this issue, we define multivariate and conditional functional chi-squared statistics. We also present an algorithm called CFDF for bivariate discrete function inference via an exclusive-effect strategy, in order to identify a best parent set for a given child. It requires each parent to make sufficient contribution beyond any marginal effect. Simulation studies suggest a marked advantage of our framework over alternatives. Applying the method to transcriptome data in genetically perturbed biological systems, we reproduced combinatorial gene interactions known in the literature. Most importantly, we identified combinatorial patterns from joint RNA and protein data to rebut a dispute on the founding principle of molecular biology. 
    more » « less
  2. Kelso, Janet (Ed.)
    Abstract Motivation Genetic or epigenetic events can rewire molecular networks to induce extraordinary phenotypical divergences. Among the many network rewiring approaches, no model-free statistical methods can differentiate gene-gene pattern changes not attributed to marginal changes. This may obscure fundamental rewiring from superficial changes. Results Here we introduce a model-free Sharma-Song test to determine if patterns differ in the second order, meaning that the deviation of the joint distribution from the product of marginal distributions is unequal across conditions. We prove an asymptotic chi-squared null distribution for the test statistic. Simulation studies demonstrate its advantage over alternative methods in detecting second-order differential patterns. Applying the test on three independent mammalian developmental transcriptome datasets, we report a lower frequency of co-expression network rewiring between human and mouse for the same tissue group than the frequency of rewiring between tissue groups within the same species. We also find secondorder differential patterns between microRNA promoters and genes contrasting cerebellum and liver development in mice. These patterns are enriched in the spliceosome pathway regulating tissue specificity. Complementary to previous mammalian comparative studies mostly driven by first-order effects, our findings contribute an understanding of system-wide second-order gene network rewiring within and across mammalian systems. Second-order differential patterns constitute evidence for fundamentally rewired biological circuitry due to evolution, environment, or disease. Availability The generic Sharma-Song test is available from the R package ‘DiffXTables’ at https://cran.r-project.org/package=DiffXTables. Other code and data are described in Methods. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Motivation

    The analysis of spatially resolved transcriptome enables the understanding of the spatial interactions between the cellular environment and transcriptional regulation. In particular, the characterization of the gene–gene co-expression at distinct spatial locations or cell types in the tissue enables delineation of spatial co-regulatory patterns as opposed to standard differential single gene analyses. To enhance the ability and potential of spatial transcriptomics technologies to drive biological discovery, we develop a statistical framework to detect gene co-expression patterns in a spatially structured tissue consisting of different clusters in the form of cell classes or tissue domains.

    Results

    We develop SpaceX (spatially dependent gene co-expression network), a Bayesian methodology to identify both shared and cluster-specific co-expression network across genes. SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model which is based on a dimension reduction technique for computational efficiency. We show via simulations, accuracy gains in co-expression network estimation and structure by accounting for (increasing) spatial correlation and appropriate noise distributions. In-depth analysis of two spatial transcriptomics datasets in mouse hypothalamus and human breast cancer using SpaceX, detected multiple hub genes which are related to cognitive abilities for the hypothalamus data and multiple cancer genes (e.g. collagen family) from the tumor region for the breast cancer data.

    Availability and implementation

    The SpaceX R-package is available at github.com/bayesrx/SpaceX.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    Computer inference of biological mechanisms is increasingly approachable due to dynamically rich data sources such as single-cell genomics. Inferred molecular interactions can prioritize hypotheses for wet-lab experiments to expedite biological discovery. However, complex data often come with unwanted biological or technical variations, exposing biases over marginal distribution and sample size in current methods to favor spurious causal relationships.

    Results

    Considering function direction and strength as evidence for causality, we present an adapted functional chi-squared test (AdpFunChisq) that rewards functional patterns over non-functional or independent patterns. On synthetic and three biology datasets, we demonstrate the advantages of AdpFunChisq over 10 methods on overcoming biases that give rise to wide fluctuations in the performance of alternative approaches. On single-cell multiomics data of multiple phenotype acute leukemia, we found that the T-cell surface glycoprotein CD3 delta chain may causally mediate specific genes in the viral carcinogenesis pathway. Using the causality-by-functionality principle, AdpFunChisq offers a viable option for robust causal inference in dynamical systems.

    Availability and implementation

    The AdpFunChisq test is implemented in the R package ‘FunChisq’ (2.5.2 or above) at https://cran.r-project.org/package=FunChisq. All other source code along with pre-processed data is available at Code Ocean https://doi.org/10.24433/CO.2907738.v1

    Supplementary information

    Supplementary materials are available at Bioinformatics online.

     
    more » « less
  5. SUMMARY

    A fast algorithm for the large-scale joint inversion of gravity and magnetic data is developed. The algorithm uses a non-linear Gramian constraint to impose correlation between the density and susceptibility of the reconstructed models. The global objective function is formulated in the space of the weighted parameters, but the Gramian constraint is implemented in the original space, and the non-linear constraint is imposed using two separate Lagrange parameters, one for each model domain. It is significant that this combined approach, using the two spaces provides more similarity between the reconstructed models. Moreover, it is shown theoretically that the gradient for the use of the unweighted space is not a scalar multiple of that used for the weighted space, and hence cannot be accounted for by adjusting the Lagrange parameters. It is assumed that the measured data are obtained on a uniform grid and that a consistent regular discretization of the volume domain is imposed. Then, the sensitivity matrices exhibit a block-Toeplitz-Toeplitz-block structure for each depth layer of the model domain, and both forward and transpose operations with the matrices can be implemented efficiently using two dimensional fast Fourier transforms. This makes it feasible to solve for large scale problems with respect to both computational costs and memory demands, and to solve the non-linear problem by applying iterative methods that rely only on matrix–vector multiplications. As such, the use of the regularized reweighted conjugate gradient algorithm, in conjunction with the structure of the sensitivity matrices, leads to a fast methodology for large-scale joint inversion of geophysical data sets. Numerical simulations demonstrate that it is possible to apply a non-linear joint inversion algorithm, with Lp-norm stabilisers, for the reconstruction of large model domains on a standard laptop computer. It is demonstrated, that while the p = 1 choice provides sparse reconstructed solutions with sharp boundaries, it is also possible to use p = 2 in order to provide smooth and blurred models. The methodology is used for inverting gravity and magnetic data obtained over an area in northwest of Mesoproterozoic St Francois Terrane, southeast of Missouri, USA.

     
    more » « less