skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Alignment-Integrated Reconstruction of Ancestral Sequences Improves Accuracy
Abstract Ancestral sequence reconstruction (ASR) uses an alignment of extant protein sequences, a phylogeny describing the history of the protein family and a model of the molecular-evolutionary process to infer the sequences of ancient proteins, allowing researchers to directly investigate the impact of sequence evolution on protein structure and function. Like all statistical inferences, ASR can be sensitive to violations of its underlying assumptions. Previous studies have shown that, whereas phylogenetic uncertainty has only a very weak impact on ASR accuracy, uncertainty in the protein sequence alignment can more strongly affect inferred ancestral sequences. Here, we show that errors in sequence alignment can produce errors in ASR across a range of realistic and simplified evolutionary scenarios. Importantly, sequence reconstruction errors can lead to errors in estimates of structural and functional properties of ancestral proteins, potentially undermining the reliability of analyses relying on ASR. We introduce an alignment-integrated ASR approach that combines information from many different sequence alignments. We show that integrating alignment uncertainty improves ASR accuracy and the accuracy of downstream structural and functional inferences, often performing as well as highly accurate structure-guided alignment. Given the growing evidence that sequence alignment errors can impact the reliability of ASR studies, we recommend that future studies incorporate approaches to mitigate the impact of alignment uncertainty. Probabilistic modeling of insertion and deletion events has the potential to radically improve ASR accuracy when the model reflects the true underlying evolutionary history, but further studies are required to thoroughly evaluate the reliability of these approaches under realistic conditions.  more » « less
Award ID(s):
1817942
PAR ID:
10351543
Author(s) / Creator(s):
;
Editor(s):
dos Reis, Mario
Date Published:
Journal Name:
Genome Biology and Evolution
Volume:
12
Issue:
9
ISSN:
1759-6653
Page Range / eLocation ID:
1549 to 1565
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequencing artifacts and errors made during genome assembly, such as abiological frameshifts and incorrect early stop codons, can impact downstream analyses leading to erroneous conclusions in comparative and functional genomic studies. More significantly, while indels can occur both within and between codons in natural sequences, most amino-acid- and codon-based aligners assume that indels only occur between codons. This mismatch between biology and alignment algorithms produces suboptimal alignments and errors in downstream analyses. To address these issues, we present COATi, a statistical, codon-aware pairwise aligner that supports complex insertion–deletion models and can handle artifacts present in genomic data. COATi allows users to reduce the amount of discarded data while generating more accurate sequence alignments. COATi can infer indels both within and between codons, leading to improved sequence alignments. We applied COATi to a dataset containing orthologous protein-coding sequences from humans and gorillas and conclude that 41% of indels occurred between codons, agreeing with previous work in other species. We also applied COATi to semiempirical benchmark alignments and find that it outperforms several popular alignment programs on several measures of alignment quality and accuracy. 
    more » « less
  2. Abstract Ancestral sequence reconstruction (ASR) is a powerful tool to study the evolution of proteins and thus gain deep insight into the relationships among protein sequence, structure, and function. A major barrier to its broad use is the complexity of the task: it requires multiple software packages, complex file manipulations, and expert phylogenetic knowledge. Here we introducetopiary, a software pipeline that aims to overcome this barrier. To use topiary, users prepare a spreadsheet with a handful of sequences. Topiary then: (1) Infers the taxonomic scope for the ASR study and finds relevant sequences by BLAST; (2) Does taxonomically informed sequence quality control and redundancy reduction; (3) Constructs a multiple sequence alignment; (4) Generates a maximum‐likelihood gene tree; (5) Reconciles the gene tree to the species tree; (6) Reconstructs ancestral amino acid sequences; and (7) Determines branch supports. The pipeline returns annotated evolutionary trees, spreadsheets with sequences, and graphical summaries of ancestor quality. This is achieved by integrating modern phylogenetics software (Muscle5, RAxML‐NG, GeneRax, and PastML) with online databases (NCBI and the Open Tree of Life). In this paper, we introduce non‐expert readers to the steps required for ASR, describe the specific design choices made intopiary, provide a detailed protocol for users, and then validate the pipeline using datasets from a broad collection of protein families. Topiary is freely available for download:https://github.com/harmslab/topiary. 
    more » « less
  3. Norwick, Katja (Ed.)
    Abstract In plants, miRNA production is orchestrated by a suite of proteins that control transcription of the pri-miRNA gene, post-transcriptional processing and nuclear export of the mature miRNA. Post-transcriptional processing of miRNAs is controlled by a pair of physically interacting proteins, hyponastic leaves 1 (HYL1) and Dicer-like 1 (DCL1). However, the evolutionary history and structural basis of the HYL1–DCL1 interaction is unknown. Here we use ancestral sequence reconstruction and functional characterization of ancestral HYL1 in vitro and in Arabidopsis thaliana to better understand the origin and evolution of the HYL1–DCL1 interaction and its impact on miRNA production and plant development. We found the ancestral plant HYL1 evolved high affinity for both double-stranded RNA (dsRNA) and its DCL1 partner before the divergence of mosses from seed plants (∼500 Ma), and these high-affinity interactions remained largely conserved throughout plant evolutionary history. Structural modeling and molecular binding experiments suggest that the second of two dsRNA-binding motifs (DSRMs) in HYL1 may interact tightly with the first of two C-terminal DCL1 DSRMs to mediate the HYL1–DCL1 physical interaction necessary for efficient miRNA production. Transgenic expression of the nearly 200 Ma-old ancestral flowering-plant HYL1 in A. thaliana was sufficient to rescue many key aspects of plant development disrupted by HYL1− knockout and restored near-native miRNA production, suggesting that the functional partnership of HYL1–DCL1 originated very early in and was strongly conserved throughout the evolutionary history of terrestrial plants. Overall, our results are consistent with a model in which miRNA-based gene regulation evolved as part of a conserved plant “developmental toolkit.” 
    more » « less
  4. Gram-positive bacteria are some of the earliest known life forms, diverging from gram-negative bacteria 2 billion years ago. These organisms utilize sortase enzymes to attach proteins to their peptidoglycan cell wall, a structural feature that distinguishes the two types of bacteria. The transpeptidase activity of sortases make them an important tool in protein engineering applications, e.g., in sortase-mediated ligations or sortagging. However, due to relatively low catalytic efficiency, there are ongoing efforts to create better sortase variants for these uses. Here, we use bioinformatics tools, principal component analysis and ancestral sequence reconstruction, in combination with protein biochemistry, to analyze natural sequence variation in these enzymes. Principal component analysis on the sortase superfamily distinguishes previously described classes and identifies regions of relatively high sequence variation in structurally-conserved loops within each sortase family, including those near the active site. Using ancestral sequence reconstruction, we determined sequences of ancestral Staphylococcus and Streptococcus Class A sortase proteins. Enzyme assays revealed that the ancestral Streptococcus enzyme is relatively active and shares similar sequence variation with other Class A Streptococcus sortases. Taken together, we highlight how natural sequence variation can be utilized to investigate this important protein family, arguing that these and similar techniques may be used to discover or design sortases with increased catalytic efficiency and/or selectivity for sortase-mediated ligation experiments. 
    more » « less
  5. Abstract The nitrogenase metalloenzyme family, essential for supplying fixed nitrogen to the biosphere, is one of life's key biogeochemical innovations. The three forms of nitrogenase differ in their metal dependence, each binding either a FeMo‐, FeV‐, or FeFe‐cofactor where the reduction of dinitrogen takes place. The history of nitrogenase metal dependence has been of particular interest due to the possible implication that ancient marine metal availabilities have significantly constrained nitrogenase evolution over geologic time. Here, we reconstructed the evolutionary history of nitrogenases, and combined phylogenetic reconstruction, ancestral sequence inference, and structural homology modeling to evaluate the potential metal dependence of ancient nitrogenases. We find that active‐site sequence features can reliably distinguish extant Mo‐nitrogenases from V‐ and Fe‐nitrogenases and that inferred ancestral sequences at the deepest nodes of the phylogeny suggest these ancient proteins most resemble modern Mo‐nitrogenases. Taxa representing early‐branching nitrogenase lineages lack one or more biosyntheticnifEandnifNgenes that both contribute to the assembly of the FeMo‐cofactor in studied organisms, suggesting that early Mo‐nitrogenases may have utilized an alternate and/or simplified pathway for cofactor biosynthesis. Our results underscore the profound impacts that protein‐level innovations likely had on shaping global biogeochemical cycles throughout the Precambrian, in contrast to organism‐level innovations that characterize the Phanerozoic Eon. 
    more » « less