skip to main content


Title: DNAcycP: a deep learning tool for DNA cyclizability prediction
Abstract

DNA mechanical properties play a critical role in every aspect of DNA-dependent biological processes. Recently a high throughput assay named loop-seq has been developed to quantify the intrinsic bendability of a massive number of DNA fragments simultaneously. Using the loop-seq data, we develop a software tool, DNAcycP, based on a deep-learning approach for intrinsic DNA cyclizability prediction. We demonstrate DNAcycP predicts intrinsic DNA cyclizability with high fidelity compared to the experimental data. Using an independent dataset from in vitro selection for enrichment of loopable sequences, we further verified the predicted cyclizability score, termed C-score, can well distinguish DNA fragments with different loopability. We applied DNAcycP to multiple species and compared the C-scores with available high-resolution chemical nucleosome maps. Our analyses showed that both yeast and mouse genomes share a conserved feature of high DNA bendability spanning nucleosome dyads. Additionally, we extended our analysis to transcription factor binding sites and surprisingly found that the cyclizability is substantially elevated at CTCF binding sites in the mouse genome. We further demonstrate this distinct mechanical property is conserved across mammalian species and is inherent to CTCF binding DNA motif.

 
more » « less
Award ID(s):
1764421
NSF-PAR ID:
10365313
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Nucleic Acids Research
Volume:
50
Issue:
6
ISSN:
0305-1048
Page Range / eLocation ID:
p. 3142-3154
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Chromatin looping is important for gene regulation, and studies of 3D chromatin structure across species and cell types have improved our understanding of the principles governing chromatin looping. However, 3D genome evolution and its relationship with natural selection remains largely unexplored. In mammals, the CTCF protein defines the boundaries of most chromatin loops, and variations in CTCF occupancy are associated with looping divergence. While many CTCF binding sites fall within transposable elements (TEs), their contribution to 3D chromatin structural evolution is unknown. Here we report the relative contributions of TE-driven CTCF binding site expansions to conserved and divergent chromatin looping in human and mouse. We demonstrate that TE-derived CTCF binding divergence may explain a large fraction of variable loops. These variable loops contribute significantly to corresponding gene expression variability across cells and species, possibly by refining sub-TAD-scale loop contacts responsible for cell-type-specific enhancer-promoter interactions.

     
    more » « less
  2. Abstract Motivation

    The three dimensional organization of chromosomes within the cell nucleus is highly regulated. It is known that CCCTC-binding factor (CTCF) is an important architectural protein to mediate long-range chromatin loops. Recent studies have shown that the majority of CTCF binding motif pairs at chromatin loop anchor regions are in convergent orientation. However, it remains unknown whether the genomic context at the sequence level can determine if a convergent CTCF motif pair is able to form a chromatin loop.

    Results

    In this article, we directly ask whether and what sequence-based features (other than the motif itself) may be important to establish CTCF-mediated chromatin loops. We found that motif conservation measured by ‘branch-of-origin’ that accounts for motif turn-over in evolution is an important feature. We developed a new machine learning algorithm called CTCF-MP based on word2vec to demonstrate that sequence-based features alone have the capability to predict if a pair of convergent CTCF motifs would form a loop. Together with functional genomic signals from CTCF ChIP-seq and DNase-seq, CTCF-MP is able to make highly accurate predictions on whether a convergent CTCF motif pair would form a loop in a single cell type and also across different cell types. Our work represents an important step further to understand the sequence determinants that may guide the formation of complex chromatin architectures.

    Availability and implementation

    The source code of CTCF-MP can be accessed at: https://github.com/ma-compbio/CTCF-MP

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract

    Aging, often considered a result of random cellular damage, can be accurately estimated using DNA methylation profiles, the foundation of pan-tissue epigenetic clocks. Here, we demonstrate the development of universal pan-mammalian clocks, using 11,754 methylation arrays from our Mammalian Methylation Consortium, which encompass 59 tissue types across 185 mammalian species. These predictive models estimate mammalian tissue age with high accuracy (r > 0.96). Age deviations correlate with human mortality risk, mouse somatotropic axis mutations and caloric restriction. We identified specific cytosines with methylation levels that change with age across numerous species. These sites, highly enriched in polycomb repressive complex 2-binding locations, are near genes implicated in mammalian development, cancer, obesity and longevity. Our findings offer new evidence suggesting that aging is evolutionarily conserved and intertwined with developmental processes across all mammals.

     
    more » « less
  4. Abstract

    Genome-wide profiling of chromatin accessibility by DNase-seq or ATAC-seq has been widely used to identify regulatory DNA elements and transcription factor binding sites. However, enzymatic DNA cleavage exhibits intrinsic sequence biases that confound chromatin accessibility profiling data analysis. Existing computational tools are limited in their ability to account for such intrinsic biases and not designed for analyzing single-cell data. Here, we present Simplex Encoded Linear Model for Accessible Chromatin (SELMA), a computational method for systematic estimation of intrinsic cleavage biases from genomic chromatin accessibility profiling data. We demonstrate that SELMA yields accurate and robust bias estimation from both bulk and single-cell DNase-seq and ATAC-seq data. SELMA can utilize internal mitochondrial DNA data to improve bias estimation. We show that transcription factor binding inference from DNase footprints can be improved by incorporating estimated biases using SELMA. Furthermore, we show strong effects of intrinsic biases in single-cell ATAC-seq data, and develop the first single-cell ATAC-seq intrinsic bias correction model to improve cell clustering. SELMA can enhance the performance of existing bioinformatics tools and improve the analysis of both bulk and single-cell chromatin accessibility sequencing data.

     
    more » « less
  5. Abstract

    To understand the process by which new protein functions emerge, we examined how the yeast heterochromatin protein Sir3 arose through gene duplication from the conserved DNA replication protein Orc1. Orc1 is a subunit of the origin recognition complex (ORC), which marks origins of DNA replication. In Saccharomyces cerevisiae, Orc1 also promotes heterochromatin assembly by recruiting the structural proteins Sir1-4 to silencer DNA. In contrast, the paralog of Orc1, Sir3, is a nucleosome-binding protein that spreads across heterochromatic loci in conjunction with other Sir proteins. We previously found that a nonduplicated Orc1 from the yeast Kluyveromyces lactis behaved like ScSir3 but did not have a silencer-binding function like ScOrc1. Moreover, K. lactis lacks Sir1, the protein that interacts directly with ScOrc1 at the silencer. Here, we examined whether the emergence of Sir1 coincided with Orc1 acting as a silencer-binding protein. In the nonduplicated species Torulaspora delbrueckii, which has an ortholog of Sir1 (TdKos3), we found that TdOrc1 spreads across heterochromatic loci independently of ORC, as ScSir3 and KlOrc1 do. This spreading is dependent on the nucleosome binding BAH domain of Orc1 and on Sir2 and Kos3. However, TdOrc1 does not have a silencer-binding function: T. delbrueckii silencers do not require ORC-binding sites to function, and Orc1 and Kos3 do not appear to interact. Instead, Orc1 and Kos3 both spread across heterochromatic loci with other Sir proteins. Thus, Orc1 and Sir1/Kos3 originally had different roles in heterochromatin formation than they do now in S. cerevisiae.

     
    more » « less