skip to main content

Title: Identifying transcription factor–DNA interactions using machine learning

Machine learning approaches have been applied to identify transcription factor (TF)–DNA interaction important for gene regulation and expression. However, due to the enormous search space of the genome, it is challenging to build models capable of surveying entire reference genomes, especially in species where models were not trained. In this study, we surveyed a variety of methods for classification of epigenomics data in an attempt to improve the detection for 12 members of the auxin response factor (ARF)-binding DNAs from maize and soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq). We used the classification for prediction by minimizing the genome search space by only surveying unmethylated regions (UMRs). For identification of DAP-seq-binding events within the UMRs, we achieved 78.72 % accuracy rate across 12 members of ARFs of maize on average by encoding DNA with count vectorization for k-mer with a logistic regression classifier with up-sampling and feature selection. Importantly, feature selection helps to uncover known and potentially novel ARF-binding motifs. This demonstrates an independent method for identification of TF-binding sites. Finally, we tested the model built with maize DAP-seq data and applied it directly to the soybean genome and found high false-negative rates, which accounted for more than 40 % across the ARF TFs tested. The findings in this study suggest the potential use of various methods to predict TF–DNA interactions within and between species with varying degrees of success.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
in silico Plants
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    AUXIN RESPONSE FACTORS (ARFs) are plant-specific transcription factors (TFs) that couple perception of the hormone auxin to gene expression programs essential to all land plants. As with many large TF families, a key question is whether individual members determine developmental specificity by binding distinct target genes. We use DAP-seq to generate genome-wide in vitro TF:DNA interaction maps for fourteen maize ARFs from the evolutionarily conserved A and B clades. Comparative analysis reveal a high degree of binding site overlap for ARFs of the same clade, but largely distinct clade A and B binding. Many sites are however co-occupied by ARFs from both clades, suggesting transcriptional coordination for many genes. Among these, we investigate known QTLs and use machine learning to predict the impact ofcis-regulatory variation. Overall, large-scale comparative analysis of ARF binding suggests that auxin response specificity may be determined by factors other than individual ARF binding site selection.

    more » « less
  2. Abstract

    Four statistical selection methods for inferring transcription factor (TF)–target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF–TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF–TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF–TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.

    more » « less
  3. Abstract

    The Heat Shock Factor (HSF) transcription factor family is a central and required component of plant heat stress responses and acquired thermotolerance. The HSF family has dramatically expanded in plant lineages, often including a repertoire of 20 or more genes. Here we assess and compare the composition, heat responsiveness, and chromatin profiles of the HSF families in maize andSetaria viridis(Setaria), two model C4 panicoid grasses. Both species encode a similar number of HSFs, and examples of both conserved and variable expression responses to a heat stress event were observed between the two species. Chromatin accessibility and genome‐wide DNA‐binding profiles were generated to assess the chromatin of HSF family members with distinct responses to heat stress. We observed significant variability for both chromatin accessibility and promoter occupancy within similarly regulated sets of HSFs betweenSetariaand maize, as well as between syntenic pairs of maize HSFs retained following its most recent genome duplication event. Additionally, we observed the widespread presence of TF binding at HSF promoters in control conditions, even at HSFs that are only expressed in response to heat stress. TF‐binding peaks were typically near putative HSF‐binding sites in HSFs upregulated in response to heat stress, but not in stable or not expressed HSFs. These observations collectively support a complex scenario of expansion and subfunctionalization within this transcription factor family and suggest that within‐family HSF transcriptional regulation is a conserved, defining feature of the family.

    more » « less
  4. null (Ed.)
    Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells. scFAN is pretrained on genome-wide bulk assay for transposase-accessible chromatin sequencing (ATAC-seq), DNA sequence, and chromatin immunoprecipitation sequencing (ChIP-seq) data and uses single-cell ATAC-seq to predict TF binding in individual cells. We demonstrate the efficacy of scFAN by both studying sequence motifs enriched within predicted binding peaks and using predicted TFs for discovering cell types. We develop a new metric “TF activity score” to characterize each cell and show that activity scores can reliably capture cell identities. scFAN allows us to discover and study cellular identities and heterogeneity based on chromatin accessibility profiles. 
    more » « less

    The stilbenoid pathway is responsible for the production of resveratrol in grapevine (Vitis viniferaL.). A few transcription factors (TFs) have been identified as regulators of this pathway but the extent of this control has not been deeply studied. Here we show how DNA affinity purification sequencing (DAP‐Seq) allows for the genome‐wide TF‐binding site interrogation in grape. We obtained 5190 and 4443 binding events assigned to 4041 and 3626 genes for MYB14 and MYB15, respectively (approximately 40% of peaks located within −10 kb of transcription start sites). DAP‐Seq of MYB14/MYB15 was combined with aggregate gene co‐expression networks (GCNs) built from more than 1400 transcriptomic datasets from leaves, fruits, and flowers to narrow down bound genes to a set of high confidence targets. The analysis of MYB14, MYB15, and MYB13, a third uncharacterized member of Subgroup 2 (S2), showed that in addition to the few previously known stilbene synthase (STS) targets, these regulators bind to 30 of 47STSfamily genes. Moreover, all three MYBs bind to severalPAL,C4H, and4CLgenes, in addition to shikimate pathway genes, theWRKY03stilbenoid co‐regulator and resveratrol‐modifying gene candidates among which ROMT2‐3 were validated enzymatically. A high proportion of DAP‐Seq bound genes were induced in the activated transcriptomes of transientMYB15‐overexpressing grapevine leaves, validating our methodological approach for delimiting TF targets. Overall, Subgroup 2 R2R3‐MYBs appear to play a key role in binding and directly regulating several primary and secondary metabolic steps leading to an increased flux towards stilbenoid production. The integration of DAP‐Seq and reciprocal GCNs offers a rapid framework for gene function characterization using genome‐wide approaches in the context of non‐model plant species and stands up as a valid first approach for identifying gene regulatory networks of specialized metabolism.

    more » « less