skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LCD-Composer: an intuitive, composition-centric method enabling the identification and detailed functional mapping of low-complexity domains
Abstract Low complexity domains (LCDs) in proteins are regions predominantly composed of a small subset of the possible amino acids. LCDs are involved in a variety of normal and pathological processes across all domains of life. Existing methods define LCDs using information-theoretical complexity thresholds, sequence alignment with repetitive regions, or statistical overrepresentation of amino acids relative to whole-proteome frequencies. While these methods have proven valuable, they are all indirectly quantifying amino acid composition, which is the fundamental and biologically-relevant feature related to protein sequence complexity. Here, we present a new computational tool, LCD-Composer, that directly identifies LCDs based on amino acid composition and linear amino acid dispersion. Using LCD-Composer's default parameters, we identified simple LCDs across all organisms available through UniProt and provide the resulting data in an accessible form as a resource. Furthermore, we describe large-scale differences between organisms from different domains of life and explore organisms with extreme LCD content for different LCD classes. Finally, we illustrate the versatility and specificity achievable with LCD-Composer by identifying diverse classes of LCDs using both simple and multifaceted composition criteria. We demonstrate that the ability to dissect LCDs based on these multifaceted criteria enhances the functional mapping and classification of LCDs.  more » « less
Award ID(s):
1817622
PAR ID:
10257354
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
NAR Genomics and Bioinformatics
Volume:
3
Issue:
2
ISSN:
2631-9268
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract SummaryLow-complexity domains (LCDs) in proteins are regions enriched in a small subset of amino acids. LCDs exist in all domains of life, often have unusual biophysical behavior, and function in both normal and pathological processes. We recently developed an algorithm to identify LCDs based predominantly on amino acid composition thresholds. Here, we have integrated this algorithm with a webserver and augmented it with additional analysis options. Specifically, users can (i) search for LCDs in whole proteomes by setting minimum composition thresholds for individual or grouped amino acids, (ii) submit a known LCD sequence to search for similar LCDs, (iii) search for and plot LCDs within a single protein, (iv) statistically test for enrichment of LCDs within a user-provided protein set and (v) specifically identify proteins with multiple types of LCDs. Availability and implementationThe LCD-Composer server can be accessed at http://lcd-composer.bmb.colostate.edu. The corresponding command-line scripts can be accessed at https://github.com/RossLabCSU/LCD-Composer/tree/master/WebserverScripts. 
    more » « less
  2. Abstract Modern life is essentially homochiral, containing D-sugars in nucleic acid backbones and L-amino acids in proteins. Since coded proteins are theorized to have developed from a prebiotic RNA World, the homochirality of L-amino acids observed in all known life presumably resulted from chiral transfer from a homochiral D-RNA World. This transfer would have been mediated by aminoacyl-RNAs defining the genetic code. Previous work on aminoacyl transfer using tRNA mimics has suggested that aminoacylation using D-RNA may be inherently biased toward reactivity with L-amino acids, implying a deterministic path from a D-RNA World to L-proteins. Using a model system of self-aminoacylating D-ribozymes and epimerizable activated amino acid analogs, we test the chiral selectivity of 15 ribozymes derived from an exhaustive search of sequence space. All of the ribozymes exhibit detectable selectivity, and a substantial fraction react preferentially to produce the D-enantiomer of the product. Furthermore, chiral preference is conserved within sequence families. These results are consistent with the transfer of chiral information from RNA to proteins but do not support an intrinsic bias of D-RNA for L-amino acids. Different aminoacylation structures result in different directions of chiral selectivity, such that L-proteins need not emerge from a D-RNA World. 
    more » « less
  3. Nature encodes the information required for life in two fundamental biopolymers: nucleic acids and proteins. Peptide nucleic acid (PNA), a synthetic analog comprised of nucleobases arrayed along a pseudopeptide backbone, has the ability to combine the power of nucleic acids to encode information with the versatility of amino acids to encode structure and function. Historically, PNA has been perceived as a simple nucleic acid mimic having desirable properties such as high biostability and strong affinity for complementary nucleic acids. In this feature article, we aim to adjust this perception by highlighting the ability of PNA to act as a peptide mimic and showing the largely untapped potential to encode information in the amino acid sequence. First, we provide an introduction to PNA and discuss the use of conjugation to impart tunable properties to the biopolymer. Next, we describe the integration of functional groups directly into the PNA backbone to impart specific physical properties. Lastly, we highlight the use of these integrated amino acid side chains to encode peptide-like sequences in the PNA backbone, imparting novel activity and function and demonstrating the ability of PNA to simultaneously mimic both a peptide and a nucleic acid. 
    more » « less
  4. The current “consensus” order in which amino acids were added to the genetic code is based on potentially biased criteria, such as the absence of sulfur-containing amino acids from the Urey–Miller experiment which lacked sulfur. More broadly, abiotic abundance might not reflect biotic abundance in the organisms in which the genetic code evolved. Here, we instead identify which protein domains date to the last universal common ancestor (LUCA) and then infer the order of recruitment from deviations of their ancestrally reconstructed amino acid frequencies from the still-ancient post-LUCA controls. We find that smaller amino acids were added to the code earlier, with no additional predictive power in the previous consensus order. Metal-binding (cysteine and histidine) and sulfur-containing (cysteine and methionine) amino acids were added to the genetic code much earlier than previously thought. Methionine and histidine were added to the code earlier than expected from their molecular weights and glutamine later. Early methionine availability is compatible with inferred early use of S-adenosylmethionine and early histidine with its purine-like structure and the demand for metal binding. Even more ancient protein sequences—those that had already diversified into multiple distinct copies prior to LUCA—have significantly higher frequencies of aromatic amino acids (tryptophan, tyrosine, phenylalanine, and histidine) and lower frequencies of valine and glutamic acid than single-copy LUCA sequences. If at least some of these sequences predate the current code, then their distinct enrichment patterns provide hints about earlier, alternative genetic codes. 
    more » « less
  5. Johnson, Patricia J (Ed.)
    ABSTRACT Analyses of codon usage in eukaryotes suggest that amino acid usage responds to GC pressure so AT-biased substitutions drive higher usage of amino acids with AT-ending codons. Here, we combine single-cell transcriptomics and phylogenomics to explore codon usage patterns in foraminifera, a diverse and ancient clade of predominantly uncultivable microeukaryotes. We curate data from 1,044 gene families in 49 individuals representing 28 genera, generating perhaps the largest existing dataset of data from a predominantly uncultivable clade of protists, to analyze compositional bias and codon usage. We find extreme variation in composition, with a median GC content at fourfold degenerate silent sites below 3% in some species and above 75% in others. The most AT-biased species are distributed among diverse non-monophyletic lineages. Surprisingly, despite the extreme variation in compositional bias, amino acid usage is highly conserved across all foraminifera. By analyzing nucleotide, codon, and amino acid composition within this diverse clade of amoeboid eukaryotes, we expand our knowledge of patterns of genome evolution across the eukaryotic tree of life.IMPORTANCEPatterns of molecular evolution in protein-coding genes reflect trade-offs between substitution biases and selection on both codon and amino acid usage. Most analyses of these factors in microbial eukaryotes focus on model species such asAcanthamoeba, Plasmodium,and yeast, where substitution bias is a primary contributor to patterns of amino acid usage. Foraminifera, an ancient clade of single-celled eukaryotes, present a conundrum, as we find highly conserved amino acid usage underlain by divergent nucleotide composition, including extreme AT-bias at silent sites among multiple non-sister lineages. We speculate that these paradoxical patterns are enabled by the dynamic genome structure of foraminifera, whose life cycles can include genome endoreplication and chromatin extrusion. 
    more » « less