skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Predicting transcriptional activation domain function using Graph Neural Networks
Abstract Analysis of factors that lead to the functionality of transcriptional activation domains remains a crucial and yet challenging task owing to the significant diversity in their sequences and their intrinsically disordered nature. Almost all existing methods that have aimed to predict activation domains have involved traditional machine learning approaches, such as logistic regression, that are unable to capture complex patterns in data or plain convolutional neural networks and have been limited in exploration of structural features. However, there is a tremendous potential in the inspection of the structural properties of activation domains, and an opportunity to investigate complex relationships between features of residues in the sequence. To address these, we have utilized the power of graph neural networks which can represent structural data in the form of nodes and edges, allowing nodes to exchange information among themselves. We have experimented with two kinds of graph formulations, one involving residues as nodes and the other assigning atoms to be the nodes. A logistic regression model was also developed to analyze feature importance. For all the models, several feature combinations were experimented with. The residue-level GNN model with amino acid type, residue position, acidic/basic/aromatic property and secondary structure feature combination gave the best performing model with accuracy, F1 score and AUROC of 97.9%, 71% and 97.1% respectively which outperformed other existing methods in the literature when applied on the dataset we used. Among the other structure-based features that were analyzed, the amphipathic property of helices also proved to be an important feature for classification. Logistic regression results showed that the most dominant feature that makes a sequence functional is the frequency of different types of amino acids in the sequence. Our results consistent have shown that functional sequences have more acidic and aromatic residues whereas basic residues are seen more in non-functional sequences.  more » « less
Award ID(s):
1925646
PAR ID:
10541461
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. Sara Osman Carolina Perdigoto (Ed.)
    Gene expression in all eukaryotes depends critically on the function of transcriptional activation domains of gene activator proteins. The conventional model for activation domain (AD) function is the direct physical recruitment of specific coactivators and transcriptional machinery components. However, ADs are short and astronomically variable sequences, with up to 10^24 possible interchangeable sequence variants for a single gene activator; each variant is intrinsically disordered in structure and interacts with its targets with low specificity and affinity. How these peptides recruit their targets is becoming increasingly difficult to explain, exposing a massive knowledge gap in molecular biology. Here, we show that the single required characteristic of ADs—consistent with their extreme variability, intrinsic structural disorder, and near-stochastic interaction mode—is an amphiphilic aromatic–acidic surfactant-like property. We propose that the AD surfactant, by triggering the local gene-promoter chromatin phase transition, catalyzes the formation of “transcription factory” condensates. We demonstrate that the presence of tryptophan and aspartic acid residues in the AD sequence is sufficient for in vivo functionality, even when present only as a single pair of residues within a 20-amino-acid sequence containing nothing more than additional 18 glycine residues. We demonstrate that the amphipathic α-helix structure, suggested previously as beneficial for AD function, is actually detrimental, and breaking this helix by inserting prolines significantly increases activation domain functionality. The proposed surfactant action mechanism based on near-stochastic interactions implied by the minimalistic activation domains changes not only the paradigm for the explanation of gene activation but also the fundamental biochemistry paradigm based on the specificity of sequence-to-structure-to-functional-interaction. The mechanism of activity regulation by near-stochastic allosteric interactions could easily be applied to other biological processes. 
    more » « less
  2. Kaplan, C (Ed.)
    Abstract Transcription factors activate gene expression in development, homeostasis, and stress with DNA binding domains and activation domains. Although there exist excellent computational models for predicting DNA binding domains from protein sequence, models for predicting activation domains from protein sequence have lagged, particularly in metazoans. We recently developed a simple and accurate predictor of acidic activation domains on human transcription factors. Here, we show how the accuracy of this human predictor arises from the clustering of aromatic, leucine, and acidic residues, which together are necessary for acidic activation domain function. When we combine our predictor with the predictions of convolutional neural network (CNN) models trained in yeast, the intersection is more accurate than individual models, emphasizing that each approach carries orthogonal information. We synthesize these findings into a new set of activation domain predictions on human transcription factors. 
    more » « less
  3. Ozkan, Banu (Ed.)
    Abstract Invariant sites are a common feature of amino acid sequence evolution. The presence of invariant sites is frequently attributed to the need to preserve function through site-specific conservation of amino acid residues. Amino acid substitution models without a provision for invariant sites often fit the data significantly worse than those that allow for an excess of invariant sites beyond those predicted by models that only incorporate rate variation among sites (e.g., a Gamma distribution). An alternative is epistasis between sites to preserve residue interactions that can create invariant sites. Through computer-simulated sequence evolution, we evaluated the relative effects of site-specific preferences and site-site couplings in the generation of invariant sites and the modulation of the rate of molecular evolution. In an analysis of ten major families of protein domains with diverse sequence and functional properties, we find that the negative selection imposed by epistasis creates many more invariant sites than site-specific residue preferences alone. Further, epistasis plays an increasingly larger role in creating invariant sites over longer evolutionary periods. Epistasis also dictates rates of domain evolution over time by exerting significant additional purifying selection to preserve site couplings. These patterns illuminate the mechanistic role of epistasis in the processes underlying observed site invariance and evolutionary rates. 
    more » « less
  4. In proteins, proline-aromatic sequences exhibit increased frequencies of cis-proline amide bonds, via proposed C–H/π interactions between the aromatic ring and either the proline ring or the backbone C–Hα of the residue prior to proline. These interactions would be expected to result in tryptophan, as the most electron-rich aromatic residue, exhibiting the highest frequency of cis-proline. However, prior results from bioinformatics studies on proteins and experiments on proline-aromatic sequences in peptides have not revealed a clear correlation between the properties of the aromatic ring and the population of cis-proline. An investigation of the effects of aromatic residue (aromatic ring properties) on the conformation of proline-aromatic sequences was conducted using three distinct approaches: (1) NMR spectroscopy in model peptides of the sequence Ac-TGPAr-NH2 (Ar = encoded and unnatural aromatic amino acids); (2) bioinformatics analysis of structures in proline-aromatic sequences in the PDB; and (3) computational investigation using DFT and MP2 methods on models of proline-aromatic sequences and interactions. C–H/π and hydrophobic interactions were observed to stabilize local structures in both the trans-proline and cis-proline conformations, with both proline amide conformations exhibiting C–H/π interactions between the aromatic ring and Hα of the residue prior to proline (Hα-trans-Pro-aromatic and Hα-cis-Pro-aromatic interactions) and/or with the proline ring (trans-ProH-aromatic and cis-ProH-aromatic interactions). These C–H/π interactions were strongest with tryptophan (Trp) and weakest with cationic histidine (HisH+). Aromatic interactions with histidine were modulated in strength by His ionization state. Proline-aromatic sequences were associated with specific conformational poses, including type I and type VI β-turns. C–H/π interactions at the pre-proline Hα, which were stronger than interactions at Pro, stabilize normally less favorable conformations, including the ζ or αL conformations at the pre-proline residue, cis-proline, and/or the g+ χ1 rotamer or αL conformation at the aromatic residue. These results indicate that proline-aromatic sequences, especially Pro-Trp sequences, are loci to nucleate turns, helices, loops, and other local structures in proteins. These results also suggest that mutations that introduce proline-aromatic sequences, such as the R406W mutation that is associated with protein misfolding and aggregation in the microtubule-binding protein tau, might result in substantial induced structure, particularly in intrinsically disordered regions of proteins. 
    more » « less
  5. Abstract The Plant-Conserved Region (P-CR) and the Class-Specific Region (CSR) are two plant-unique sequences in the catalytic core of cellulose synthases (CESAs) for which specific functions have not been established. Here, we used site-directed mutagenesis to replace amino acids and motifs within these sequences predicted to be essential for assembly and function of CESAs. We developed an in vivo method to determine the ability of mutated CesA1 transgenes to complement an Arabidopsis (Arabidopsis thaliana) temperature-sensitive root-swelling1 (rsw1) mutant. Replacement of a Cys residue in the CSR, which blocks dimerization in vitro, rendered the AtCesA1 transgene unable to complement the rsw1 mutation. Examination of the CSR sequences from 33 diverse angiosperm species showed domains of high-sequence conservation in a class-specific manner but with variation in the degrees of disorder, indicating a nonredundant role of the CSR structures in different CESA isoform classes. The Cys residue essential for dimerization was not always located in domains of intrinsic disorder. Expression of AtCesA1 transgene constructs, in which Pro417 and Arg453 were substituted for Ala or Lys in the coiled-coil of the P-CR, were also unable to complement the rsw1 mutation. Despite an expected role for Arg457 in trimerization of CESA proteins, AtCesA1 transgenes with Arg457Ala mutations were able to fully restore the wild-type phenotype in rsw1. Our data support that Cys662 within the CSR and Pro417 and Arg453 within the P-CR of Arabidopsis CESA1 are essential residues for functional synthase complex formation, but our data do not support a specific role for Arg457 in trimerization in native CESA complexes. 
    more » « less