skip to main content

Title: A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed themsa_pipelineworkflow ( to allow practical and sensitive multiple alignment of diverged plant genomes and calculation of conservation scores with minimal user inputs. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the effect of different masking approaches and parameters of the LAST aligner using genome assemblies of 33 grass species. Compared with conventional masking with RepeatMasker, a masking approach based onk‐mers (nucleotide sequences ofklength) increased the alignment rate of coding sequence and noncoding functional regions by 25 and 14%, respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for noncoding functional regions by over 52% compared with default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of noncoding sites that can be scored for conservation by over 76%. Overall, tuning of masking and alignment parameters can generate optimized multiple alignments to drive biological discovery in plants.

more » « less
Award ID(s):
1907343 1822330
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
The Plant Genome
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow ( based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%. 
    more » « less
  2. INTRODUCTION A major challenge in genomics is discerning which bases among billions alter organismal phenotypes and affect health and disease risk. Evidence of past selective pressure on a base, whether highly conserved or fast evolving, is a marker of functional importance. Bases that are unchanged in all mammals may shape phenotypes that are essential for organismal health. Bases that are evolving quickly in some species, or changed only in species that share an adaptive trait, may shape phenotypes that support survival in specific niches. Identifying bases associated with exceptional capacity for cellular recovery, such as in species that hibernate, could inform therapeutic discovery. RATIONALE The power and resolution of evolutionary analyses scale with the number and diversity of species compared. By analyzing genomes for hundreds of placental mammals, we can detect which individual bases in the genome are exceptionally conserved (constrained) and likely to be functionally important in both coding and noncoding regions. By including species that represent all orders of placental mammals and aligning genomes using a method that does not require designating humans as the reference species, we explore unusual traits in other species. RESULTS Zoonomia’s mammalian comparative genomics resources are the most comprehensive and statistically well-powered produced to date, with a protein-coding alignment of 427 mammals and a whole-genome alignment of 240 placental mammals representing all orders. We estimate that at least 10.7% of the human genome is evolutionarily conserved relative to neutrally evolving repeats and identify about 101 million significantly constrained single bases (false discovery rate < 0.05). We cataloged 4552 ultraconserved elements at least 20 bases long that are identical in more than 98% of the 240 placental mammals. Many constrained bases have no known function, illustrating the potential for discovery using evolutionary measures. Eighty percent are outside protein-coding exons, and half have no functional annotations in the Encyclopedia of DNA Elements (ENCODE) resource. Constrained bases tend to vary less within human populations, which is consistent with purifying selection. Species threatened with extinction have few substitutions at constrained sites, possibly because severely deleterious alleles have been purged from their small populations. By pairing Zoonomia’s genomic resources with phenotype annotations, we find genomic elements associated with phenotypes that differ between species, including olfaction, hibernation, brain size, and vocal learning. We associate genomic traits, such as the number of olfactory receptor genes, with physical phenotypes, such as the number of olfactory turbinals. By comparing hibernators and nonhibernators, we implicate genes involved in mitochondrial disorders, protection against heat stress, and longevity in this physiologically intriguing phenotype. Using a machine learning–based approach that predicts tissue-specific cis - regulatory activity in hundreds of species using data from just a few, we associate changes in noncoding sequence with traits for which humans are exceptional: brain size and vocal learning. CONCLUSION Large-scale comparative genomics opens new opportunities to explore how genomes evolved as mammals adapted to a wide range of ecological niches and to discover what is shared across species and what is distinctively human. High-quality data for consistently defined phenotypes are necessary to realize this potential. Through partnerships with researchers in other fields, comparative genomics can address questions in human health and basic biology while guiding efforts to protect the biodiversity that is essential to these discoveries. Comparing genomes from 240 species to explore the evolution of placental mammals. Our new phylogeny (black lines) has alternating gray and white shading, which distinguishes mammalian orders (labeled around the perimeter). Rings around the phylogeny annotate species phenotypes. Seven species with diverse traits are illustrated, with black lines marking their branch in the phylogeny. Sequence conservation across species is described at the top left. IMAGE CREDIT: K. MORRILL 
    more » « less
  3. Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. INTRODUCTION Resolving the role that different environmental forces may have played in the apparent explosive diversification of modern placental mammals is crucial to understanding the evolutionary context of their living and extinct morphological and genomic diversity. RATIONALE Limited access to whole-genome sequence alignments that sample living mammalian biodiversity has hampered phylogenomic inference, which until now has been limited to relatively small, highly constrained sequence matrices often representing <2% of a typical mammalian genome. To eliminate this sampling bias, we used an alignment of 241 whole genomes to comprehensively identify and rigorously analyze noncoding, neutrally evolving sequence variation in coalescent and concatenation-based phylogenetic frameworks. These analyses were followed by validation with multiple classes of phylogenetically informative structural variation. This approach enabled the generation of a robust time tree for placental mammals that evaluated age variation across hundreds of genomic loci that are not restricted by protein coding annotations. RESULTS Coalescent and concatenation phylogenies inferred from multiple treatments of the data were highly congruent, including support for higher-level taxonomic groupings that unite primates+colugos with treeshrews (Euarchonta), bats+cetartiodactyls+perissodactyls+carnivorans+pangolins (Scrotifera), all scrotiferans excluding bats (Fereuungulata), and carnivorans+pangolins with perissodactyls (Zooamata). However, because these approaches infer a single best tree, they mask signatures of phylogenetic conflict that result from incomplete lineage sorting and historical hybridization. Accordingly, we also inferred phylogenies from thousands of noncoding loci distributed across chromosomes with historically contrasting recombination rates. Throughout the radiation of modern orders (such as rodents, primates, bats, and carnivores), we observed notable differences between locus trees inferred from the autosomes and the X chromosome, a pattern typical of speciation with gene flow. We show that in many cases, previously controversial phylogenetic relationships can be reconciled by examining the distribution of conflicting phylogenetic signals along chromosomes with variable historical recombination rates. Lineage divergence time estimates were notably uniform across genomic loci and robust to extensive sensitivity analyses in which the underlying data, fossil constraints, and clock models were varied. The earliest branching events in the placental phylogeny coincide with the breakup of continental landmasses and rising sea levels in the Late Cretaceous. This signature of allopatric speciation is congruent with the low genomic conflict inferred for most superordinal relationships. By contrast, we observed a second pulse of diversification immediately after the Cretaceous-Paleogene (K-Pg) extinction event superimposed on an episode of rapid land emergence. Greater geographic continuity coupled with tumultuous climatic changes and increased ecological landscape at this time provided enhanced opportunities for mammalian diversification, as depicted in the fossil record. These observations dovetail with increased phylogenetic conflict observed within clades that diversified in the Cenozoic. CONCLUSION Our genome-wide analysis of multiple classes of sequence variation provides the most comprehensive assessment of placental mammal phylogeny, resolves controversial relationships, and clarifies the timing of mammalian diversification. We propose that the combination of Cretaceous continental fragmentation and lineage isolation, followed by the direct and indirect effects of the K-Pg extinction at a time of rapid land emergence, synergistically contributed to the accelerated diversification rate of placental mammals during the early Cenozoic. The timing of placental mammal evolution. Superordinal mammalian diversification took place in the Cretaceous during periods of continental fragmentation and sea level rise with little phylogenomic discordance (pie charts: left, autosomes; right, X chromosome), which is consistent with allopatric speciation. By contrast, the Paleogene hosted intraordinal diversification in the aftermath of the K-Pg mass extinction event, when clades exhibited higher phylogenomic discordance consistent with speciation with gene flow and incomplete lineage sorting. 
    more » « less
  5. Abstract

    Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.

    Availability and implementation

    Our software is available open source at

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

    more » « less