Abstract Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Availability and implementation Source code implemented in C ++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Ribbon: intuitive visualization for complex genomic variation
Abstract Summary Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons. Availability and implementation Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon. Supplementary information Supplementary data are available at Bioinformatics online.
- Birol, Inanc
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Page Range or eLocation-ID:
- 413 to 415
- Sponsoring Org:
- National Science Foundation
More Like this
Robinson, Peter (Ed.)Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact firstname.lastname@example.org Supplementary information Supplementary data are available at Bioinformatics online.
DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteinsAbstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is notedmore »
Abstract Motivation Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. Results In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-readmore »
Stamatakis, Alexandros (Ed.)Abstract Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome (‘samples-specific’ strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the resultsmore »