skip to main content

Title: Biclustering by sparse canonical correlation analysis

Developing appropriate computational tools to distill biological insights from large‐scale gene expression data has been an important part of systems biology. Considering that gene relationships may change or only exist in a subset of collected samples, biclustering that involves clustering both genes and samples has become in‐creasingly important, especially when the samples are pooled from a wide range of experimental conditions.


In this paper, we introduce a new biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong “group interactions” under certain subsets of samples. Group interactions are defined by strong partial correlations, or equivalently, conditional dependencies between EFs after removing the influences of a set of other functionally related EFs. Our new biclustering method, named SCCA‐BC, extends an existing method for group interaction inference, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning of the gene expression data set.


SCCA‐BC gives sensible results on real data sets and outperforms most existing methods in simulations. Software is available at‐bc.


SCCA‐BC seems to work in numerous conditions and the results seem promising for future extensions. SCCA‐BC has the ability to find different types of bicluster patterns, and it is especially advantageous in identifying a bicluster whose elements share the same progressive and multivariate normal distribution with a dense covariance matrix.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Quantitative Biology
Medium: X Size: p. 56-67
["p. 56-67"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed.


    We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq.

    Availability and implementation

    The source code of QUBIC2 is freely available at

    Contact or

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  2. Cowen, Lenore (Ed.)
    Abstract Motivation Biclustering has emerged as a powerful approach to identifying functional patterns in complex biological data. However, existing tools are limited by their accuracy and efficiency to recognize various kinds of complex biclusters submerged in ever large datasets. We introduce a novel fast and highly accurate algorithm RecBic to identify various forms of complex biclusters in gene expression datasets. Results We designed RecBic to identify various trend-preserving biclusters, particularly, those with narrow shapes, i.e. clusters where the number of genes is larger than the number of conditions/samples. Given a gene expression matrix, RecBic starts with a column seed, and grows it into a full-sized bicluster by simply repetitively comparing real numbers. When tested on simulated datasets in which the elements of implanted trend-preserving biclusters and those of the background matrix have the same distribution, RecBic was able to identify the implanted biclusters in a nearly perfect manner, outperforming all the compared salient tools in terms of accuracy and robustness to noise and overlaps between the clusters. Moreover, RecBic also showed superiority in identifying functionally related genes in real gene expression datasets. Availability and implementation Code, sample input data and usage instructions are available at the following websites. Code: Data: Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Motivation

    Metagenomic and metatranscriptomic analyses can provide an abundance of information related to microbial communities. However, straightforward analysis of this data does not provide optimal results, with a required integration of data types being needed to thoroughly investigate these microbiomes and their environmental interactions.


    Here, we present MetaQUBIC, an integrated biclustering-based computational pipeline for gene module detection that integrates both metagenomic and metatranscriptomic data. Additionally, we used this pipeline to investigate 735 paired DNA and RNA human gut microbiome samples, resulting in a comprehensive hybrid gene expression matrix of 2.3 million cross-species genes in the 735 human fecal samples and 155 functional enriched gene modules. We believe both the MetaQUBIC pipeline and the generated comprehensive human gut hybrid expression matrix will facilitate further investigations into multiple levels of microbiome studies.

    Availability and implementation

    The package is freely available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  4. Abstract Background

    The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3′-untranslated region (3′-UTR) of mRNA produces transcripts with shorter or longer 3′-UTR. Often, 3′-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3′-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3′-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3′-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3′-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.


    APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3′-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3′-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3′-UTR annotation and read coverage on the 3′-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at


    APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3′-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3′-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3′-UTR APA events and improve genome annotation.


    APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3′-UTR APA events. The pipeline integrates both RNA-seq and 3′-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots.

    more » « less
  5. Summary Open Research Badges

    This article has earned an Open Data Badge for making publicly available the digitally‐shareable data necessary to reproduce the reported results. The data is available at;

    more » « less