skip to main content


Title: scHiCDiff: detecting differential chromatin interactions in single-cell Hi-C data
Abstract Summary

Here, we presented the scHiCDiff software tool that provides both nonparametric tests and parametirc models to detect differential chromatin interactions (DCIs) from single-cell Hi-C data. We thoroughly evaluated the scHiCDiff methods on both simulated and real data. Our results demonstrated that scHiCDiff, especially the zero-inflated negative binomial model option, can effectively detect reliable and consistent single-cell DCIs between two conditions, thereby facilitating the study of cell type-specific variations of chromatin structures at the single-cell level.

Availability and implementation

scHiCDiff is implemented in R and freely available at GitHub (https://github.com/wmalab/scHiCDiff).

 
more » « less
NSF-PAR ID:
10470742
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
10
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The three dimensional organization of chromosomes within the cell nucleus is highly regulated. It is known that CCCTC-binding factor (CTCF) is an important architectural protein to mediate long-range chromatin loops. Recent studies have shown that the majority of CTCF binding motif pairs at chromatin loop anchor regions are in convergent orientation. However, it remains unknown whether the genomic context at the sequence level can determine if a convergent CTCF motif pair is able to form a chromatin loop.

    Results

    In this article, we directly ask whether and what sequence-based features (other than the motif itself) may be important to establish CTCF-mediated chromatin loops. We found that motif conservation measured by ‘branch-of-origin’ that accounts for motif turn-over in evolution is an important feature. We developed a new machine learning algorithm called CTCF-MP based on word2vec to demonstrate that sequence-based features alone have the capability to predict if a pair of convergent CTCF motifs would form a loop. Together with functional genomic signals from CTCF ChIP-seq and DNase-seq, CTCF-MP is able to make highly accurate predictions on whether a convergent CTCF motif pair would form a loop in a single cell type and also across different cell types. Our work represents an important step further to understand the sequence determinants that may guide the formation of complex chromatin architectures.

    Availability and implementation

    The source code of CTCF-MP can be accessed at: https://github.com/ma-compbio/CTCF-MP

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    Gene regulatory networks define regulatory relationships between transcription factors and target genes within a biological system, and reconstructing them is essential for understanding cellular growth and function. Methods for inferring and reconstructing networks from genomics data have evolved rapidly over the last decade in response to advances in sequencing technology and machine learning. The scale of data collection has increased dramatically; the largest genome-wide gene expression datasets have grown from thousands of measurements to millions of single cells, and new technologies are on the horizon to increase to tens of millions of cells and above.

    Results

    In this work, we present the Inferelator 3.0, which has been significantly updated to integrate data from distinct cell types to learn context-specific regulatory networks and aggregate them into a shared regulatory network, while retaining the functionality of the previous versions. The Inferelator is able to integrate the largest single-cell datasets and learn cell-type-specific gene regulatory networks. Compared to other network inference methods, the Inferelator learns new and informative Saccharomyces cerevisiae networks from single-cell gene expression data, measured by recovery of a known gold standard. We demonstrate its scaling capabilities by learning networks for multiple distinct neuronal and glial cell types in the developing Mus musculus brain at E18 from a large (1.3 million) single-cell gene expression dataset with paired single-cell chromatin accessibility data.

    Availability and implementation

    The inferelator software is available on GitHub (https://github.com/flatironinstitute/inferelator) under the MIT license and has been released as python packages with associated documentation (https://inferelator.readthedocs.io/).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Background

    Single-cell RNA sequencing (scRNA-seq) technology has enabled assessment of transcriptome-wide changes at single-cell resolution. Due to the heterogeneity in environmental exposure and genetic background across subjects, subject effect contributes to the major source of variation in scRNA-seq data with multiple subjects, which severely confounds cell type specific differential expression (DE) analysis. Moreover, dropout events are prevalent in scRNA-seq data, leading to excessive number of zeroes in the data, which further aggravates the challenge in DE analysis.

    Results

    We developed iDESC to detect cell type specific DE genes between two groups of subjects in scRNA-seq data. iDESC uses a zero-inflated negative binomial mixed model to consider both subject effect and dropouts. The prevalence of dropout events (dropout rate) was demonstrated to be dependent on gene expression level, which is modeled by pooling information across genes. Subject effect is modeled as a random effect in the log-mean of the negative binomial component. We evaluated and compared the performance of iDESC with eleven existing DE analysis methods. Using simulated data, we demonstrated that iDESC had well-controlled type I error and higher power compared to the existing methods. Applications of those methods with well-controlled type I error to three real scRNA-seq datasets from the same tissue and disease showed that the results of iDESC achieved the best consistency between datasets and the best disease relevance.

    Conclusions

    iDESC was able to achieve more accurate and robust DE analysis results by separating subject effect from disease effect with consideration of dropouts to identify DE genes, suggesting the importance of considering subject effect and dropouts in the DE analysis of scRNA-seq data with multiple subjects.

     
    more » « less
  4. Abstract Background

    Crop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at high resolution, within fewer than hundreds of base pairs. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we use genomic annotations to accurately predict nucleotide conservation across angiosperms, as a proxy for fitness effect of mutations.

    Results

    Using only sequence analysis, we annotate nonsynonymous mutations in 25,824 maize gene models, with information from bioinformatics and deep learning. Our predictions are validated by experimental information: within-species conservation, chromatin accessibility, and gene expression. According to gene ontology and pathway enrichment analyses, predicted nucleotide conservation points to genes in central carbon metabolism. Importantly, it improves genomic prediction for fitness-related traits such as grain yield, in elite maize panels, by stringent prioritization of fewer than 1% of single-site variants.

    Conclusions

    Our results suggest that predicting nucleotide conservation across angiosperms may effectively prioritize sites most likely to impact fitness-related traits in crops, without being limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Our approach—Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC)—could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing. The trained PICNC models and predicted nucleotide conservation at protein-coding SNPs in maize are publicly available in CyVerse (https://doi.org/10.25739/hybz-2957).

     
    more » « less
  5. Abstract Motivation

    High throughput chromosome conformation capture (Hi-C) contact matrices are used to predict 3D chromatin structures in eukaryotic cells. High-resolution Hi-C data are less available than low-resolution Hi-C data due to sequencing costs but provide greater insight into the intricate details of 3D chromatin structures such as enhancer–promoter interactions and sub-domains. To provide a cost-effective solution to high-resolution Hi-C data collection, deep learning models are used to predict high-resolution Hi-C matrices from existing low-resolution matrices across multiple cell types.

    Results

    Here, we present two Cascading Residual Networks called HiCARN-1 and HiCARN-2, a convolutional neural network and a generative adversarial network, that use a novel framework of cascading connections throughout the network for Hi-C contact matrix prediction from low-resolution data. Shown by image evaluation and Hi-C reproducibility metrics, both HiCARN models, overall, outperform state-of-the-art Hi-C resolution enhancement algorithms in predictive accuracy for both human and mouse 1/16, 1/32, 1/64 and 1/100 downsampled high-resolution Hi-C data. Also, validation by extracting topologically associating domains, chromosome 3D structure and chromatin loop predictions from the enhanced data shows that HiCARN can proficiently reconstruct biologically significant regions.

    Availability and implementation

    HiCARN can be accessed and utilized as an open-sourced software at: https://github.com/OluwadareLab/HiCARN and is also available as a containerized application that can be run on any platform.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less