Phylogenomic analyses have revolutionized the study of biodiversity, but they have revealed that estimated tree topologies can depend, at least in part, on the subset of the genome that is analyzed. For example, estimates of trees for avian orders differ if protein-coding or non-coding data are analyzed. The bird tree is a good study system because the historical signal for relationships among orders is very weak, which should permit subtle non-historical signals to be identified, while monophyly of orders is strongly corroborated, allowing identification of strong non-historical signals. Hydrophobic amino acids in mitochondrially-encoded proteins, which are expected to be found in transmembrane helices, have been hypothesized to be associated with non-historical signals. We tested this hypothesis by comparing the evolution of transmembrane helices and extramembrane segments of mitochondrial proteins from 420 bird species, sampled from most avian orders. We estimated amino acid exchangeabilities for both structural environments and assessed the performance of phylogenetic analysis using each data type. We compared those relative exchangeabilities with values calculated using a substitution matrix for transmembrane helices estimated using a variety of nuclear- and mitochondrially-encoded proteins, allowing us to compare the bird-specific mitochondrial models with a general model of transmembrane protein evolution. To complement our amino acid analyses, we examined the impact of protein structure on patterns of nucleotide evolution. Models of transmembrane and extramembrane sequence evolution for amino acids and nucleotides exhibited striking differences, but there was no evidence for strong topological data type effects. However, incorporating protein structure into analyses of mitochondrially-encoded proteins improved model fit. Thus, we believe that considering protein structure will improve analyses of mitogenomic data, both in birds and in other taxa.
more »
« less
GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models
Abstract Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g. the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.
more »
« less
- PAR ID:
- 10539836
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Molecular Biology and Evolution
- Volume:
- 41
- Issue:
- 9
- ISSN:
- 0737-4038
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationMost proteins perform their biological functions through interactions with other proteins in cells. Amino acid mutations, especially those occurring at protein interfaces, can change the stability of protein–protein interactions (PPIs) and impact their functions, which may cause various human diseases. Quantitative estimation of the binding affinity changes (ΔΔGbind) caused by mutations can provide critical information for protein function annotation and genetic disease diagnoses. ResultsWe present SSIPe, which combines protein interface profiles, collected from structural and sequence homology searches, with a physics-based energy function for accurate ΔΔGbind estimation. To offset the statistical limits of the PPI structure and sequence databases, amino acid-specific pseudocounts were introduced to enhance the profile accuracy. SSIPe was evaluated on large-scale experimental data containing 2204 mutations from 177 proteins, where training and test datasets were stringently separated with the sequence identity between proteins from the two datasets below 30%. The Pearson correlation coefficient between estimated and experimental ΔΔGbind was 0.61 with a root-mean-square-error of 1.93 kcal/mol, which was significantly better than the other methods. Detailed data analyses revealed that the major advantage of SSIPe over other traditional approaches lies in the novel combination of the physical energy function with the new knowledge-based interface profile. SSIPe also considerably outperformed a former profile-based method (BindProfX) due to the newly introduced sequence profiles and optimized pseudocount technique that allows for consideration of amino acid-specific prior mutation probabilities. Availability and implementationWeb-server/standalone program, source code and datasets are freely available at https://zhanglab.ccmb.med.umich.edu/SSIPe and https://github.com/tommyhuangthu/SSIPe. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Johnson, Patricia J (Ed.)ABSTRACT Analyses of codon usage in eukaryotes suggest that amino acid usage responds to GC pressure so AT-biased substitutions drive higher usage of amino acids with AT-ending codons. Here, we combine single-cell transcriptomics and phylogenomics to explore codon usage patterns in foraminifera, a diverse and ancient clade of predominantly uncultivable microeukaryotes. We curate data from 1,044 gene families in 49 individuals representing 28 genera, generating perhaps the largest existing dataset of data from a predominantly uncultivable clade of protists, to analyze compositional bias and codon usage. We find extreme variation in composition, with a median GC content at fourfold degenerate silent sites below 3% in some species and above 75% in others. The most AT-biased species are distributed among diverse non-monophyletic lineages. Surprisingly, despite the extreme variation in compositional bias, amino acid usage is highly conserved across all foraminifera. By analyzing nucleotide, codon, and amino acid composition within this diverse clade of amoeboid eukaryotes, we expand our knowledge of patterns of genome evolution across the eukaryotic tree of life.IMPORTANCEPatterns of molecular evolution in protein-coding genes reflect trade-offs between substitution biases and selection on both codon and amino acid usage. Most analyses of these factors in microbial eukaryotes focus on model species such asAcanthamoeba, Plasmodium,and yeast, where substitution bias is a primary contributor to patterns of amino acid usage. Foraminifera, an ancient clade of single-celled eukaryotes, present a conundrum, as we find highly conserved amino acid usage underlain by divergent nucleotide composition, including extreme AT-bias at silent sites among multiple non-sister lineages. We speculate that these paradoxical patterns are enabled by the dynamic genome structure of foraminifera, whose life cycles can include genome endoreplication and chromatin extrusion.more » « less
-
Abstract Genetic code expansion technology allows for the use of noncanonical amino acids (ncAAs) to create semisynthetic organisms for both biochemical and biomedical applications. However, exogenous feeding of chemically synthesized ncAAs at high concentrations is required to compensate for the inefficient cellular uptake and incorporation of these components into proteins, especially in the case of eukaryotic cells and multicellular organisms. To generate organisms capable of autonomously biosynthesizing an ncAA and incorporating it into proteins, we have engineered a metabolic pathway for the synthesis ofO‐methyltyrosine (OMeY). Specifically, we endowed organisms with a marformycins biosynthetic pathway‐derived methyltransferase that efficiently converts tyrosine to OMeY in the presence of the co‐factorS‐adenosylmethionine. The resulting cells can produce and site‐specifically incorporate OMeY into proteins at much higher levels than cells exogenously fed OMeY. To understand the structural basis for the substrate selectivity of the transferase, we solved the X‐ray crystal structures of the ligand‐free and tyrosine‐bound enzymes. Most importantly, we have extended this OMeY biosynthetic system to both mammalian cells and the zebrafish model to enhance the utility of genetic code expansion. The creation of autonomous eukaryotes using a 21st amino acid will make genetic code expansion technology more applicable to multicellular organisms, providing valuable vertebrate models for biological and biomedical research.more » « less
-
Abstract Complex N-glycans are asparagine (N)-linked branched sugar chains attached to secretory proteins in eukaryotes. They are produced by modification of N-linked oligosaccharide structures in the endoplasmic reticulum and Golgi apparatus. Complex N-glycans formed in the Golgi apparatus are often assigned specific roles unique to the host organism, with their roles in plants remaining largely unknown. Using inhibitor (kifunensine, KIF) hypersensitivity as read out, we identified Arabidopsis mutants that require complex N-glycan modification. Among >100 KIF-sensitive mutants, one showing abnormal secretory organelles and a salt-sensitive phenotype contained a point mutation leading to amino acid replacement (G69R) in ARFA1E, a small Arf1-GTPase family protein presumably involved in vesicular transport. In vitro assays showed that the G69R exchange interferes with protein activation. In vivo, ARFA1EG69R caused dominant-negative effects, altering the morphology of the endoplasmic reticulum, Golgi apparatus, and trans-Golgi network (TGN). Post-Golgi transport (endocytosis/endocytic recycling) of the essential glycoprotein KORRIGAN1, one of the KIF sensitivity targets, is slowed down constitutively as well as under salt stress in the ARFA1EG69R mutant. Because regulated cycling of plasma membrane proteins is required for stress tolerance of the host plants, the ARFA1EG69R mutant established a link between KIF-targeted luminal glycoprotein functions/dynamics and cytosolic regulators of vesicle transport in endosome-/cell wall-associated tolerance mechanisms.more » « less
An official website of the United States government
