skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: eCOMPASS: evaluative comparison of multiple protein alignments by statistical score
Abstract Motivation Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. Results eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): To maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. Availability and Implementation The eCOMPASS executable, C ++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1817942
PAR ID:
10238805
Author(s) / Creator(s):
; ;
Editor(s):
Ponty, Yann
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationMultiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. ResultsHere, we implement a smooth and differentiable version of the Smith–Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. Availability and implementationOur code and examples are available at: https://github.com/spetti/SMURF. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Robinson, Peter (Ed.)
    Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact niema@ucsd.edu Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Boeva, Valentina (Ed.)
    Abstract Summary Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. Availability and implementation UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract MotivationDespite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem. ResultsWe present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given ‘backbone’ alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments. Availability and implementationHMMerge is freely available at https://github.com/MinhyukPark/HMMerge. Supplementary informationSupplementary data are available at Bioinformatics Advances online. 
    more » « less