skip to main content

Title: Identifying transcription factor–DNA interactions using machine learning

Machine learning approaches have been applied to identify transcription factor (TF)–DNA interaction important for gene regulation and expression. However, due to the enormous search space of the genome, it is challenging to build models capable of surveying entire reference genomes, especially in species where models were not trained. In this study, we surveyed a variety of methods for classification of epigenomics data in an attempt to improve the detection for 12 members of the auxin response factor (ARF)-binding DNAs from maize and soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq). We used the classification for prediction by minimizing the genome search space by only surveying unmethylated regions (UMRs). For identification of DAP-seq-binding events within the UMRs, we achieved 78.72 % accuracy rate across 12 members of ARFs of maize on average by encoding DNA with count vectorization for k-mer with a logistic regression classifier with up-sampling and feature selection. Importantly, feature selection helps to uncover known and potentially novel ARF-binding motifs. This demonstrates an independent method for identification of TF-binding sites. Finally, we tested the model built with maize DAP-seq data and applied it directly to the soybean genome and found high false-negative rates, which accounted for more » more than 40 % across the ARF TFs tested. The findings in this study suggest the potential use of various methods to predict TF–DNA interactions within and between species with varying degrees of success.

« less
; ; ; ; ;
Publication Date:
Journal Name:
in silico Plants
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    AUXIN RESPONSE FACTORS (ARFs) are plant-specific transcription factors (TFs) that couple perception of the hormone auxin to gene expression programs essential to all land plants. As with many large TF families, a key question is whether individual members determine developmental specificity by binding distinct target genes. We use DAP-seq to generate genome-wide in vitro TF:DNA interaction maps for fourteen maize ARFs from the evolutionarily conserved A and B clades. Comparative analysis reveal a high degree of binding site overlap for ARFs of the same clade, but largely distinct clade A and B binding. Many sites are however co-occupied by ARFs from both clades, suggesting transcriptional coordination for many genes. Among these, we investigate known QTLs and use machine learning to predict the impact ofcis-regulatory variation. Overall, large-scale comparative analysis of ARF binding suggests that auxin response specificity may be determined by factors other than individual ARF binding site selection.

  2. Abstract

    DNA mechanical properties play a critical role in every aspect of DNA-dependent biological processes. Recently a high throughput assay named loop-seq has been developed to quantify the intrinsic bendability of a massive number of DNA fragments simultaneously. Using the loop-seq data, we develop a software tool, DNAcycP, based on a deep-learning approach for intrinsic DNA cyclizability prediction. We demonstrate DNAcycP predicts intrinsic DNA cyclizability with high fidelity compared to the experimental data. Using an independent dataset from in vitro selection for enrichment of loopable sequences, we further verified the predicted cyclizability score, termed C-score, can well distinguish DNA fragments with different loopability. We applied DNAcycP to multiple species and compared the C-scores with available high-resolution chemical nucleosome maps. Our analyses showed that both yeast and mouse genomes share a conserved feature of high DNA bendability spanning nucleosome dyads. Additionally, we extended our analysis to transcription factor binding sites and surprisingly found that the cyclizability is substantially elevated at CTCF binding sites in the mouse genome. We further demonstrate this distinct mechanical property is conserved across mammalian species and is inherent to CTCF binding DNA motif.

  3. Marshall-Colon, Amy (Ed.)
    Abstract Gene regulatory networks (GRNs) are defined by a cascade of transcriptional events by which signals, such as hormones or environmental cues, change development. To understand these networks, it is necessary to link specific transcription factors (TFs) to the downstream gene targets whose expression they regulate. Although multiple methods provide information on the targets of a single TF, moving from groups of co-expressed genes to the TF that controls them is more difficult. To facilitate this bottom-up approach, we have developed a web application named TF DEACoN. This application uses a publicly available Arabidopsis thaliana DNA Affinity Purification (DAP-Seq) data set to search for TFs that show enriched binding to groups of co-regulated genes. We used TF DEACoN to examine groups of transcripts regulated by treatment with the ethylene precursor 1-aminocyclopropane-1-carboxylic acid (ACC), using a transcriptional data set performed with high temporal resolution. We demonstrate the utility of this application when co-regulated genes are divided by timing of response or cell-type-specific information, which provides more information on TF/target relationships than when less defined and larger groups of co-regulated genes are used. This approach predicted TFs that may participate in ethylene-modulated root development including the TF NAM (NO APICAL MERISTEM). Wemore »used a genetic approach to show that a mutation in NAM reduces the negative regulation of lateral root development by ACC. The combination of filtering and TF DEACoN used here can be applied to any group of co-regulated genes to predict GRNs that control coordinated transcriptional responses.« less
  4. Abstract

    Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.

  5. Polyacetylenic lipids accumulate in various Apiaceae species after pathogen attack, suggesting that these compounds are naturally occurring pesticides and potentially valuable resources for crop improvement. These compounds also promote human health and slow tumor growth. Even though polyacetylenic lipids were discovered decades ago, the biosynthetic pathway underlying their production is largely unknown. To begin filling this gap and ultimately enable polyacetylene engineering, we studied polyacetylenes and their biosynthesis in the major Apiaceae crop carrot (Daucus carota subsp. sativus). Using gas chromatography and mass spectrometry, we identified three known polyacetylenes and assigned provisional structures to two novel polyacetylenes. We also quantified these compounds in carrot leaf, petiole, root xylem, root phloem, and root periderm extracts. Falcarindiol and falcarinol predominated and accumulated primarily in the root periderm. Since the multiple double and triple carbon-carbon bonds that distinguish polyacetylenes from ubiquitous fatty acids are often introduced by Δ12 oleic acid desaturase (FAD2)-type enzymes, we mined the carrot genome for FAD2 genes. We identified a FAD2 family with an unprecedented 24 members and analyzed public, tissue-specific carrot RNA-Seq data to identify coexpressed members with root periderm-enhanced expression. Six candidate genes were heterologously expressed individually and in combination in yeast and Arabidopsis (Arabidopsis thaliana), resultingmore »in the identification of one canonical FAD2 that converts oleic to linoleic acid, three divergent FAD2-like acetylenases that convert linoleic into crepenynic acid, and two bifunctional FAD2s with Δ12 and Δ14 desaturase activity that convert crepenynic into the further desaturated dehydrocrepenynic acid, a polyacetylene pathway intermediate. These genes can now be used as a basis for discovering other steps of falcarin-type polyacetylene biosynthesis, to modulate polyacetylene levels in plants, and to test the in planta function of these molecules. Many organisms implement specialized biochemical pathways to convert ubiquitous metabolites into bioactive chemical compounds. Since plants comprise the majority of the human diet, specialized plant metabolites play crucial roles not only in crop biology but also in human nutrition. Some asterids produce lipid compounds called polyacetylenes (for review, see Negri, 2015) that exhibit antifungal activity (Garrod et al., 1978; Kemp, 1978; Harding and Heale, 1980, 1981; Olsson and Svensson, 1996) and accumulate in response to fungal phytopathogen attack (De Wit and Kodde, 1981; Elgersma and Liem, 1989). These observations have led to the longstanding hypothesis that polyacetylenes are natural pesticides. These same lipid compounds exhibit cytotoxic activity against human cancer cell lines and slow tumor growth (Fujimoto and Satoh, 1988; Matsunaga et al., 1989, 1990; Cunsolo et al., 1993; Bernart et al., 1996; Kobaek-Larsen et al., 2005; Zidorn et al., 2005), making them important nutritional compounds. The major source of polyacetylenes in the human diet is carrot (Daucus carota L.). Carrot is one of the most important crop species in the Apiaceae, with rapidly increasing worldwide cultivation (Rubatzky et al., 1999; Dawid et al., 2015). The most common carrot polyacetylenes are C17 linear aliphatic compounds containing two conjugated carbon-carbon triple bonds, one or two carbon-carbon double bonds, and a diversity of additional in-chain oxygen-containing functional groups. In carrot, the most abundant of these compounds are falcarinol and falcarindiol (Dawid et al., 2015). Based on their structures, it has been hypothesized that these compounds (alias falcarin-type polyacetylenes) are derived from ubiquitous fatty acids. Indeed, biochemical investigations (Haigh et al., 1968; Bohlman, 1988), radio-chemical tracer studies (Barley et al., 1988), and the discovery of pathway intermediates (Jones et al., 1966; Kawazu et al., 1973) implicate a diversion of flux away from linolenate biosynthesis as the entry point into falcarin-type polyacetylene biosynthesis (for review, see Minto and Blacklock, 2008). The final steps of linolenate biosynthesis are the conversion of oleate to linoleate, mediated by fatty acid desaturase 2 (FAD2), and linoleate to linolenate, catalyzed by FAD3. Some plant species contain divergent forms of FAD2 that, instead of or in addition to converting oleate to linoleate, catalyze the installation of unusual in-chain functional groups such as hydroxyl groups, epoxy groups, conjugated double bonds, or carbon-carbon triple bonds into the acyl chain (Badami and Patil, 1980) and thus divert flux from linolenate production into the accumulation of unusual fatty acids. Previous work in parsley (Petroselinum crispum; Apiaceae) identified a divergent form of FAD2 that (1) was up-regulated in response to pathogen treatment and (2) when expressed in soybean embryos resulted in production of the monoyne crepenynate and, by the action of an unassigned enzyme, dehydrocrepenynate (Kirsch et al., 1997; Cahoon et al., 2003). The results of the parsley studies are consistent with a pathogen-responsive, divergent FAD2-mediated pathway that leads to acetylenic fatty acids. However, information regarding the branch point into acetylenic fatty acid production in agriculturally relevant carrot is still largely missing, in particular, the identification and functional characterization of enzymes that can divert carbon flux away from linolenate biosynthesis into the production of dehydrocrepenynate and ultimately falcarin-type polyacetylenes. Such genes, once identified, could be used in the future design of transgenic carrot lines with altered polyacetylene content, enabling direct testing of in planta polyacetylene function and potentially the engineering of pathogen-resistant, more nutritious carrots. These genes could also provide the foundation for further investigations of more basic aspects of plant biology, including the evolution of fatty acid-derived natural product biosynthesis pathways across the Asterid clade, as well as the role of these pathways and compounds in plant ecology and plant defense. Recently, a high-quality carrot genome assembly was released (Iorizzo et al., 2016), providing a foundation for genome-enabled studies of Apiaceous species. This study also provided publicly accessible RNA sequencing (RNA-Seq) data from diverse carrot tissues. Using these resources, this study aimed to provide a detailed gas chromatography-based quantification of polyacetylenes in carrot tissues for which RNA-Seq data are available, then combine this information with bioinformatics analysis and heterologous expression to identify and characterize biosynthetic genes that underlie the major entry point into carrot polyacetylene biosynthesis. To achieve these goals, thin-layer chromatography (TLC) was combined with gas chromatography-mass spectrometry (GC-MS) and gas chromatography-flame ionization detection to identify and quantify polyacetylenic metabolites in five different carrot tissues. Then the sequences and tissue expression profiles of potential FAD2 and FAD2-like genes annotated in the D. carota genome were compared with the metabolite data to identify candidate pathway genes, followed by biochemical functionality tests using yeast (Saccharomyces cerevisae) and Arabidopsis (Arabidopsis thaliana) as heterologous expression systems.« less