skip to main content


Title: COBRAC: a fast implementation of convex biclustering with compression
Abstract   Biclustering is a generalization of clustering used to identify simultaneous grouping patterns in observations (rows) and features (columns) of a data matrix. Recently, the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well. To address this problem and make convex biclustering a practical tool for analyzing larger data, we propose an implementation of fast convex biclustering called COBRAC to reduce the computing time by iteratively compressing problem size along the solution path. We apply COBRAC to several gene expression datasets to demonstrate its effectiveness and efficiency. Besides the standalone version for COBRAC, we also developed a related online web server for online calculation and visualization of the downloadable interactive results. Availability The source code and test data are available at https://github.com/haidyi/cvxbiclustr or https://zenodo.org/record/4620218. The web server is available at https://cvxbiclustr.ericchi.com. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1752692 2201136
PAR ID:
10252252
Author(s) / Creator(s):
; ; ;
Editor(s):
Mathelier, Anthony
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Summary

    Although advances in untargeted metabolomics have made it possible to gather data on thousands of cellular metabolites in parallel, identification of novel metabolites from these datasets remains challenging. To address this need, Metabolic in silico Network Expansions (MINEs) were developed. A MINE is an expansion of known biochemistry which can be used as a list of potential structures for unannotated metabolomics peaks. Here, we present MINE 2.0, which utilizes a new set of biochemical transformation rules that covers 93% of MetaCyc reactions (compared to 25% in MINE 1.0). This results in a 17-fold increase in database size and a 40% increase in MINE database compounds matching unannotated peaks from an untargeted metabolomics dataset. MINE 2.0 is thus a significant improvement to this community resource.

    Availability and implementation

    The MINE 2.0 website can be accessed at https://minedatabase.ci.northwestern.edu. The MINE 2.0 web API documentation can be accessed at https://mine-api.readthedocs.io/en/latest/. The data and code underlying this article are available in the MINE-2.0-Paper repository at https://github.com/tyo-nu/MINE-2.0-Paper. MINE 2.0 source code can be accessed at https://github.com/tyo-nu/MINE-Database (MINE construction), https://github.com/tyo-nu/MINE-Server (backend web API) and https://github.com/tyo-nu/MINE-app (web app).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    Growth phenotype profiling of genome-wide gene-deletion strains over stress conditions can offer a clear picture that the essentiality of genes depends on environmental conditions. Systematically identifying groups of genes from such high-throughput data that share similar patterns of conditional essentiality and dispensability under various environmental conditions can elucidate how genetic interactions of the growth phenotype are regulated in response to the environment.

    Results

    We first demonstrate that detecting such ‘co-fit’ gene groups can be cast as a less well-studied problem in biclustering, i.e. constant-column biclustering. Despite significant advances in biclustering techniques, very few were designed for mining in growth phenotype data. Here, we propose Gracob, a novel, efficient graph-based method that casts and solves the constant-column biclustering problem as a maximal clique finding problem in a multipartite graph. We compared Gracob with a large collection of widely used biclustering methods that cover different types of algorithms designed to detect different types of biclusters. Gracob showed superior performance on finding co-fit genes over all the existing methods on both a variety of synthetic data sets with a wide range of settings, and three real growth phenotype datasets for E. coli, proteobacteria and yeast.

    Availability and Implementation

    Our program is freely available for download at http://sfb.kaust.edu.sa/Pages/Software.aspx.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Summary

    We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators.

    Availability and implementation

    GWASpro is freely available at https://bioinfo.noble.org/GWASPRO.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Przytycka, Teresa (Ed.)
    Abstract Summary We present StochSS Live!, a web-based service for modeling, simulation and analysis of a wide range of mathematical, biological and biochemical systems. Using an epidemiological model of COVID-19, we demonstrate the power of StochSS Live! to enable researchers to quickly develop a deterministic or a discrete stochastic model, infer its parameters and analyze the results. Availability and implementation StochSS Live! is freely available at https://live.stochss.org/ Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Motivation

    The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed.

    Results

    We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq.

    Availability and implementation

    The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2.

    Contact

    czhang87@iu.edu or qin.ma@osumc.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less