skip to main content

Title: Measuring Phylogenetic Information of Incomplete Sequence Data
Abstract Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are more » most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.] « less
; ;
Ho, Simon
Award ID(s):
Publication Date:
Journal Name:
Systematic Biology
Page Range or eLocation-ID:
630 to 648
Sponsoring Org:
National Science Foundation
More Like this
  1. dos Reis, Mario (Ed.)
    Abstract Ancestral sequence reconstruction (ASR) uses an alignment of extant protein sequences, a phylogeny describing the history of the protein family and a model of the molecular-evolutionary process to infer the sequences of ancient proteins, allowing researchers to directly investigate the impact of sequence evolution on protein structure and function. Like all statistical inferences, ASR can be sensitive to violations of its underlying assumptions. Previous studies have shown that, whereas phylogenetic uncertainty has only a very weak impact on ASR accuracy, uncertainty in the protein sequence alignment can more strongly affect inferred ancestral sequences. Here, we show that errors in sequence alignment can produce errors in ASR across a range of realistic and simplified evolutionary scenarios. Importantly, sequence reconstruction errors can lead to errors in estimates of structural and functional properties of ancestral proteins, potentially undermining the reliability of analyses relying on ASR. We introduce an alignment-integrated ASR approach that combines information from many different sequence alignments. We show that integrating alignment uncertainty improves ASR accuracy and the accuracy of downstream structural and functional inferences, often performing as well as highly accurate structure-guided alignment. Given the growing evidence that sequence alignment errors can impact the reliability of ASR studies, wemore »recommend that future studies incorporate approaches to mitigate the impact of alignment uncertainty. Probabilistic modeling of insertion and deletion events has the potential to radically improve ASR accuracy when the model reflects the true underlying evolutionary history, but further studies are required to thoroughly evaluate the reliability of these approaches under realistic conditions.« less
  2. Abstract

    Arbuscular mycorrhizal fungi (AMF; Glomeromycota) are difficult to culture; therefore, establishing a robust amplicon-based approach to taxa identification is imperative to describe AMF diversity. Further, due to low and biased sampling of AMF taxa, molecular databases do not represent the breadth of AMF diversity, making database matching approaches suboptimal. Therefore, a full description of AMF diversity requires a tool to determine sequence-based placement in the Glomeromycota clade. Nonetheless, commonly used gene regions, including the SSU and ITS, do not enable reliable phylogenetic placement. Here, we present an improved database and pipeline for the phylogenetic determination of AMF using amplicons from the large subunit (LSU) rRNA gene. We improve our database and backbone tree by including additional outgroup sequences. We also improve an existing bioinformatics pipeline by aligning forward and reverse reads separately, using a universal alignment for all tree building, and implementing a BLAST screening prior to tree building to remove non-homologous sequences. Finally, we present a script to extract AMF belonging to 11 major families as well as an amplicon sequencing variant (ASV) version of our pipeline. We test the utility of the pipeline by testing the placement of known AMF, known non-AMF, andAcaulosporasp. spore sequences. This workmore »represents the most comprehensive database and pipeline for phylogenetic placement of AMF LSU amplicon sequences within the Glomeromycota clade.

    « less
  3. ABSTRACT It is well known that the polarized continuum emission from magnetically aligned dust grains is determined to a large extent by local magnetic field structure. However, the observed significant anticorrelation between polarization fraction and column density may be strongly affected, perhaps even dominated by variations in grain alignment efficiency with local conditions, in contrast to standard assumptions of a spatially homogeneous grain alignment efficiency. Here we introduce a generic way to incorporate heterogeneous grain alignment into synthetic polarization observations of molecular clouds (MCs), through a simple model where the grain alignment efficiency depends on the local gas density as a power law. We justify the model using results derived from radiative torque alignment theory. The effects of power-law heterogeneous alignment models on synthetic observations of simulated MCs are presented. We find that the polarization fraction-column density correlation can be brought into agreement with observationally determined values through heterogeneous alignment, though there remains degeneracy with the relative strength of cloud-scale magnetized turbulence and the mean magnetic field orientation relative to the observer. We also find that the dispersion in polarization angles-polarization fraction correlation remains robustly correlated despite the simultaneous changes to both observables in the presence of heterogeneous alignment.
  4. Ponty, Yann (Ed.)
    Abstract Motivation Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. Results eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): To maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. Availability and Implementation The eCOMPASS executable, C ++ open source code and input data sets are available at Supplementary information Supplementary data are available at Bioinformatics online.
  5. Abstract Target enrichment (such as Hyb-Seq) is a well-established high throughput sequencing method that has been increasingly used for phylogenomic studies. Unfortunately, current widely used pipelines for analysis of target enrichment data do not have a vigorous procedure to remove paralogs in target enrichment data. In this study, we develop a pipeline we call Putative Paralogs Detection (PPD) to better address putative paralogs from enrichment data. The new pipeline is an add-on to the existing HybPiper pipeline, and the entire pipeline applies criteria in both sequence similarity and heterozygous sites at each locus in the identification of paralogs. Users may adjust the thresholds of sequence identity and heterozygous sites to identify and remove paralogs according to the level of phylogenetic divergence of their group of interest. The new pipeline also removes highly polymorphic sites attributed to errors in sequence assembly and gappy regions in the alignment. We demonstrated the value of the new pipeline using empirical data generated from Hyb-Seq and the Angiosperm 353 kit for two woody genera Castanea (Fagaceae, Fagales) and Hamamelis (Hamamelidaceae, Saxifragales). Comparisons of datasets showed that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence timesmore »showed evident differences between data from HybPiper and data from our new PPD pipeline. We further evaluated the accuracy and error rates of PPD by BLAST mapping of putative paralogous and orthologous sequences to a reference genome sequence of Castanea mollissima. Compared to HybPiper alone, PPD identified substantially more paralogous gene sequences that mapped to multiple regions of the reference genome (31 genes for PPD compared with 4 genes for HybPiper alone). In conjunction with HybPiper, paralogous genes identified by both pipelines can be removed resulting in the construction of more robust orthologous gene datasets for phylogenomic and divergence time analyses. Our study demonstrates the value of Hyb-Seq with data derived from the Angiosperm 353 probe set for elucidating species relationships within a genus, and argues for the importance of additional steps to filter paralogous genes and poorly aligned regions (e.g., as occur through assembly errors), such as our new PPD pipeline described in this study.« less