skip to main content


Title: Identifying transcription factor–DNA interactions using machine learning
Abstract

Machine learning approaches have been applied to identify transcription factor (TF)–DNA interaction important for gene regulation and expression. However, due to the enormous search space of the genome, it is challenging to build models capable of surveying entire reference genomes, especially in species where models were not trained. In this study, we surveyed a variety of methods for classification of epigenomics data in an attempt to improve the detection for 12 members of the auxin response factor (ARF)-binding DNAs from maize and soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq). We used the classification for prediction by minimizing the genome search space by only surveying unmethylated regions (UMRs). For identification of DAP-seq-binding events within the UMRs, we achieved 78.72 % accuracy rate across 12 members of ARFs of maize on average by encoding DNA with count vectorization for k-mer with a logistic regression classifier with up-sampling and feature selection. Importantly, feature selection helps to uncover known and potentially novel ARF-binding motifs. This demonstrates an independent method for identification of TF-binding sites. Finally, we tested the model built with maize DAP-seq data and applied it directly to the soybean genome and found high false-negative rates, which accounted for more than 40 % across the ARF TFs tested. The findings in this study suggest the potential use of various methods to predict TF–DNA interactions within and between species with varying degrees of success.

 
more » « less
Award ID(s):
2026554
NSF-PAR ID:
10369987
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
in silico Plants
Volume:
4
Issue:
2
ISSN:
2517-5025
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    AUXIN RESPONSE FACTORS (ARFs) are plant-specific transcription factors (TFs) that couple perception of the hormone auxin to gene expression programs essential to all land plants. As with many large TF families, a key question is whether individual members determine developmental specificity by binding distinct target genes. We use DAP-seq to generate genome-wide in vitro TF:DNA interaction maps for fourteen maize ARFs from the evolutionarily conserved A and B clades. Comparative analysis reveal a high degree of binding site overlap for ARFs of the same clade, but largely distinct clade A and B binding. Many sites are however co-occupied by ARFs from both clades, suggesting transcriptional coordination for many genes. Among these, we investigate known QTLs and use machine learning to predict the impact ofcis-regulatory variation. Overall, large-scale comparative analysis of ARF binding suggests that auxin response specificity may be determined by factors other than individual ARF binding site selection.

     
    more » « less
  2. Abstract

    Four statistical selection methods for inferring transcription factor (TF)–target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF–TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF–TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF–TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.

     
    more » « less
  3. Abstract

    The Heat Shock Factor (HSF) transcription factor family is a central and required component of plant heat stress responses and acquired thermotolerance. The HSF family has dramatically expanded in plant lineages, often including a repertoire of 20 or more genes. Here we assess and compare the composition, heat responsiveness, and chromatin profiles of the HSF families in maize andSetaria viridis(Setaria), two model C4 panicoid grasses. Both species encode a similar number of HSFs, and examples of both conserved and variable expression responses to a heat stress event were observed between the two species. Chromatin accessibility and genome‐wide DNA‐binding profiles were generated to assess the chromatin of HSF family members with distinct responses to heat stress. We observed significant variability for both chromatin accessibility and promoter occupancy within similarly regulated sets of HSFs betweenSetariaand maize, as well as between syntenic pairs of maize HSFs retained following its most recent genome duplication event. Additionally, we observed the widespread presence of TF binding at HSF promoters in control conditions, even at HSFs that are only expressed in response to heat stress. TF‐binding peaks were typically near putative HSF‐binding sites in HSFs upregulated in response to heat stress, but not in stable or not expressed HSFs. These observations collectively support a complex scenario of expansion and subfunctionalization within this transcription factor family and suggest that within‐family HSF transcriptional regulation is a conserved, defining feature of the family.

     
    more » « less
  4. SUMMARY

    The stilbenoid pathway is responsible for the production of resveratrol in grapevine (Vitis viniferaL.). A few transcription factors (TFs) have been identified as regulators of this pathway but the extent of this control has not been deeply studied. Here we show how DNA affinity purification sequencing (DAP‐Seq) allows for the genome‐wide TF‐binding site interrogation in grape. We obtained 5190 and 4443 binding events assigned to 4041 and 3626 genes for MYB14 and MYB15, respectively (approximately 40% of peaks located within −10 kb of transcription start sites). DAP‐Seq of MYB14/MYB15 was combined with aggregate gene co‐expression networks (GCNs) built from more than 1400 transcriptomic datasets from leaves, fruits, and flowers to narrow down bound genes to a set of high confidence targets. The analysis of MYB14, MYB15, and MYB13, a third uncharacterized member of Subgroup 2 (S2), showed that in addition to the few previously known stilbene synthase (STS) targets, these regulators bind to 30 of 47STSfamily genes. Moreover, all three MYBs bind to severalPAL,C4H, and4CLgenes, in addition to shikimate pathway genes, theWRKY03stilbenoid co‐regulator and resveratrol‐modifying gene candidates among which ROMT2‐3 were validated enzymatically. A high proportion of DAP‐Seq bound genes were induced in the activated transcriptomes of transientMYB15‐overexpressing grapevine leaves, validating our methodological approach for delimiting TF targets. Overall, Subgroup 2 R2R3‐MYBs appear to play a key role in binding and directly regulating several primary and secondary metabolic steps leading to an increased flux towards stilbenoid production. The integration of DAP‐Seq and reciprocal GCNs offers a rapid framework for gene function characterization using genome‐wide approaches in the context of non‐model plant species and stands up as a valid first approach for identifying gene regulatory networks of specialized metabolism.

     
    more » « less
  5. Marshall-Colon, Amy (Ed.)
    Abstract Gene regulatory networks (GRNs) are defined by a cascade of transcriptional events by which signals, such as hormones or environmental cues, change development. To understand these networks, it is necessary to link specific transcription factors (TFs) to the downstream gene targets whose expression they regulate. Although multiple methods provide information on the targets of a single TF, moving from groups of co-expressed genes to the TF that controls them is more difficult. To facilitate this bottom-up approach, we have developed a web application named TF DEACoN. This application uses a publicly available Arabidopsis thaliana DNA Affinity Purification (DAP-Seq) data set to search for TFs that show enriched binding to groups of co-regulated genes. We used TF DEACoN to examine groups of transcripts regulated by treatment with the ethylene precursor 1-aminocyclopropane-1-carboxylic acid (ACC), using a transcriptional data set performed with high temporal resolution. We demonstrate the utility of this application when co-regulated genes are divided by timing of response or cell-type-specific information, which provides more information on TF/target relationships than when less defined and larger groups of co-regulated genes are used. This approach predicted TFs that may participate in ethylene-modulated root development including the TF NAM (NO APICAL MERISTEM). We used a genetic approach to show that a mutation in NAM reduces the negative regulation of lateral root development by ACC. The combination of filtering and TF DEACoN used here can be applied to any group of co-regulated genes to predict GRNs that control coordinated transcriptional responses. 
    more » « less