skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Supervised learning of enhancer–promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning
Abstract MotivationUnderstanding the rules that govern enhancer-driven transcription remains a central unsolved problem in genomics. Now with multiple massively parallel enhancer perturbation assays published, there are enough data that we can utilize to learn to predict enhancer–promoter (EP) relationships in a data-driven manner. ResultsWe applied machine learning to one of the largest enhancer perturbation studies integrated with transcription factor (TF) and histone modification ChIP-seq. The results uncovered a discrepancy in the prediction of genome-wide data compared to data from targeted experiments. Relative strength of contact was important for prediction, confirming the basic principle of EP regulation. Novel features such as the density of the enhancers/promoters in the genomic region was found to be important, highlighting our lack of understanding on how other elements in the region contribute to the regulation. Several TF peaks were identified that improved the prediction by identifying the negatives and reducing False Positives. In summary, integrating genomic assays with enhancer perturbation studies increased the accuracy of the model, and provided novel insights into the understanding of enhancer-driven transcription. Availability and implementationThe trained models, data, and the source code are available at http://doi.org/10.5281/zenodo.11290386 and https://github.com/HanLabUNLV/sleps.  more » « less
Award ID(s):
1750532
PAR ID:
10518612
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
40
Issue:
6
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences,in vivogenomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained onin vitroTF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific,in vivobinding profiles. Conversely, deep learning models, trained onin vivoTF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models ofin vitroandin vivoTF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinitiesde-novofrom deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diversein vitroassays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant ofin vivobinding, suggest that deep learning models ofin vivobinding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughputin silicoexperiments to explore the influence of sequence context and variation on both intrinsic affinity andin vivooccupancy. 
    more » « less
  2. Abstract During gene regulation, DNA accessibility is thought to limit the availability of transcription factor (TF) binding sites, while TFs can increase DNA accessibility to recruit additional factors that upregulate gene expression. Given this interplay, the causative regulatory events in the modulation of gene expression remain unknown for the vast majority of genes. We utilized deeply sequenced ATAC-Seq data and site-specific knock-in reporter genes to investigate the relationship between the binding-site resolution dynamics of DNA accessibility and the expression dynamics of the enhancers of Cebpa during macrophage-neutrophil differentiation. While the enhancers upregulate reporter expression during the earliest stages of differentiation, there is little corresponding increase in their total accessibility. Conversely, total accessibility peaks during the last stages of differentiation without any increase in enhancer activity. The accessibility of positions neighboring C/EBP-family TF binding sites, which indicates TF occupancy, does increase significantly during early differentiation, showing that the early upregulation of enhancer activity is driven by TF binding. These results imply that a generalized increase in DNA accessibility is not sufficient, and binding by enhancer-specific TFs is necessary, for the upregulation of gene expression. Additionally, high-coverage ATAC-Seq combined with time-series expression data can infer the sequence of regulatory events at binding-site resolution. 
    more » « less
  3. Abstract BackgroundIdentifying the DNA-binding specificities of transcription factors (TF) is central to understanding gene networks that regulate growth and development. Such knowledge is lacking in oomycetes, a microbial eukaryotic lineage within the stramenopile group. Oomycetes include many important plant and animal pathogens such as the potato and tomato blight agentPhytophthora infestans, which is a tractable model for studying life-stage differentiation within the group. ResultsMining of the P. infestans genome identified 197 genes encoding proteins belonging to 22 TF families. Their chromosomal distribution was consistent with family expansions through unequal crossing-over, which were likely ancient since each family had similar sizes in most oomycetes. Most TFs exhibited dynamic changes in RNA levels through the P. infestanslife cycle. The DNA-binding preferences of 123 proteins were assayed using protein-binding oligonucleotide microarrays, which succeeded with 73 proteins from 14 families. Binding sites predicted for representatives of the families were validated by electrophoretic mobility shift or chromatin immunoprecipitation assays. Consistent with the substantial evolutionary distance of oomycetes from traditional model organisms, only a subset of the DNA-binding preferences resembled those of human or plant orthologs. Phylogenetic analyses of the TF families withinP. infestansoften discriminated clades with canonical and novel DNA targets. Paralogs with similar binding preferences frequently had distinct patterns of expression suggestive of functional divergence. TFs were predicted to either drive life stage-specific expression or serve as general activators based on the representation of their binding sites within total or developmentally-regulated promoters. This projection was confirmed for one TF using synthetic and mutated promoters fused to reporter genesin vivo. ConclusionsWe established a large dataset of binding specificities forP. infestansTFs, representing the first in the stramenopile group. This resource provides a basis for understanding transcriptional regulation by linking TFs with their targets, which should help delineate the molecular components of processes such as sporulation and host infection. Our work also yielded insight into TF evolution during the eukaryotic radiation, revealing both functional conservation as well as diversification across kingdoms. 
    more » « less
  4. Abstract Mutations in cis-regulatory regions play an important role in the domestication and improvement of crops by altering gene expression. However, assessing the in vivo impact of cis-regulatory elements (CREs) on transcriptional regulation and phenotypic outcomes remains challenging. Previously, we showed that the dominant Barren inflorescence3 (Bif3) mutant of maize (Zea mays) contains a duplicated copy of the homeobox transcription factor gene ZmWUSCHEL1 (ZmWUS1), named ZmWUS1-B. ZmWUS1-B is controlled by a spontaneously generated novel promoter region that dramatically increases its expression and alters patterning and development of young ears. Overexpression of ZmWUS1-B is caused by a unique enhancer region containing multimerized binding sites for type B RESPONSE REGULATORs (RRs), key transcription factors in cytokinin signaling. To better understand how the enhancer increases the expression of ZmWUS1 in vivo, we specifically targeted the ZmWUS1-B enhancer region by CRISPR-Cas9-mediated editing. A series of deletion events with different numbers of type B RR DNA binding motifs (AGATAT) enabled us to determine how the number of AGATAT motifs impacts in vivo expression of ZmWUS1-B and consequently ear development. In combination with dual-luciferase assays in maize protoplasts, our analysis reveals that AGATAT motifs have an additive effect on ZmWUS1-B expression, while the distance separating AGATAT motifs does not appear to have a meaningful impact, indicating that the enhancer activity derives from the sum of individual CREs. These results also suggest that in maize inflorescence development, there is a threshold of buffering capacity for ZmWUS1 overexpression. 
    more » « less
  5. Abstract DNA–transcription factor (TF) interactions are essential for gene regulation. Fully characterizing TF recognition specificities and identifying their genomic binding targets are important to understand TF function and regulatory networks. Recently, high-throughput sequencing technology HT-SELEX (high-throughput systematic evolution of ligands by exponential enrichment) has been used to measure hundreds of TFs, providing massive datasets that comprise TF binding preferences. However, there is a need to develop comprehensive computational modeling to fully extract and characterize critical TF binding preferences and fail to distinguish genome-wide binding targets. In this study, we developed a global pairwise model called DCA-Scapes trained with experimental HT-SELEX data. Our approach uncovered high-resolution TF recognition specificity landscapes, enabled the prediction of in vivo binding sequences, and was validated with ChIP-seq (ChIP sequencing) data. In addition, the DCA-Scapes model was utilized to refine the locations of binding regions and accurately identify the binding sites within the ChIP-seq enriched peaks. Moreover, we extended our model to cover the entire human genome, uncovering potential TF target sites that exhibit tissue-specific TF recognition across various cellular environments. 
    more » « less