skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: New alignment-based sequence extraction software (ALiBaSeq) and its utility for deep level phylogenetics
Despite many bioinformatic solutions for analyzing sequencing data, few options exist for targeted sequence retrieval from whole genomic sequencing (WGS) data with the ultimate goal of generating a phylogeny. Available tools especially struggle at deep phylogenetic levels and necessitate amino-acid space searches, which may increase rates of false positive results. Many tools are also difficult to install and may lack adequate user resources. Here, we describe a program that uses freely available similarity search tools to find homologs in assembled WGS data with unparalleled freedom to modify parameters. We evaluate its performance compared to other commonly used bioinformatics tools on two divergent insect species (>200 My) for which annotated genomes exist, and on one large set each of highly conserved and more variable loci. Our software is capable of retrieving orthologs from well-curated or unannotated, low or high depth shotgun, and target capture assemblies as well or better than other software as assessed by recovering the most genes with maximal coverage and with a low rate of false positives throughout all datasets. When assessing this combination of criteria, ALiBaSeq is frequently the best evaluated tool for gathering the most comprehensive and accurate phylogenetic alignments on all types of data tested. The software (implemented in Python), tutorials, and manual are freely available at https://github.com/AlexKnyshov/alibaseq .  more » « less
Award ID(s):
1655769
PAR ID:
10226445
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
PeerJ
Volume:
9
ISSN:
2167-8359
Page Range / eLocation ID:
e11019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Russell, Schwartz (Ed.)
    Abstract Motivation With growing genome-wide molecular datasets from next-generation sequencing, phylogenetic networks can be estimated using a variety of approaches. These phylogenetic networks include events like hybridization, gene flow or horizontal gene transfer explicitly. However, the most accurate network inference methods are computationally heavy. Methods that scale to larger datasets do not calculate a full likelihood, such that traditional likelihood-based tools for model selection are not applicable to decide how many past hybridization events best fit the data. We propose here a goodness-of-fit test to quantify the fit between data observed from genome-wide multi-locus data, and patterns expected under the multi-species coalescent model on a candidate phylogenetic network. Results We identified weaknesses in the previously proposed TICR test, and proposed corrections. The performance of our new test was validated by simulations on real-world phylogenetic networks. Our test provides one of the first rigorous tools for model selection, to select the adequate network complexity for the data at hand. The test can also work for identifying poorly inferred areas on a network. Availability and implementation Software for the goodness-of-fit test is available as a Julia package at https://github.com/cecileane/QuartetNetworkGoodnessFit.jl. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract Laboratory strains, cell lines, and other genetic materials change hands frequently in the life sciences. Despite evidence that such materials are subject to mix-ups, contamination, and accumulation of secondary mutations, verification of strains and samples is not an established part of many experimental workflows. With the plummeting cost of next generation technologies, it is conceivable that whole genome sequencing (WGS) could be applied to routine strain and sample verification in the future. To demonstrate the need for strain validation by WGS, we sequenced haploid yeast segregants derived from a popular commercial mutant collection and identified several unexpected mutations. We determined that available bioinformatics tools may be ill-suited for verification and highlight the importance of finishing reference genomes for commonly used laboratory strains. 
    more » « less
  3. Abstract Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories. 
    more » « less
  4. Björkroth, Johanna (Ed.)
    ABSTRACT Giardia duodenalis (syn. Giardia lamblia , Giardia intestinalis ) is the causative agent of giardiasis, one of the most common diarrheal infections in humans. Evolutionary relationships among G. duodenalis genotypes (or subtypes) of assemblage B, one of two genetic assemblages causing the majority of human infections, remain unclear due to poor phylogenetic resolution of current typing methods. In this study, we devised a methodology to identify new markers for a streamlined multilocus sequence typing (MLST) scheme based on comparisons of all core genes against the phylogeny of whole-genome sequences (WGS). Our analysis identified three markers with resolution comparable to that of WGS data. Using newly designed PCR primers for our novel MLST loci, we typed an additional 68 strains of assemblage B. Analyses of these strains and previously determined genome sequences showed that genomes of this assemblage can be assigned to 16 clonal complexes, each with unique gene content that is apparently tuned to differential virulence and ecology. Obtaining new genomes of Giardia spp. and other eukaryotic microbial pathogens remains challenging due to difficulties in culturing the parasites in the laboratory. Hence, the methods described here are expected to be widely applicable to other pathogens of interest and advance our understanding of their ecology and evolution. IMPORTANCE Giardia duodenalis assemblage B is a major waterborne pathogen and the most commonly identified genotype causing human giardiasis worldwide. The lack of morphological characters for classification requires the use of molecular techniques for strain differentiation; however, the absence of scalable and affordable next-generation sequencing (NGS)-based typing methods has prevented meaningful advancements in high-resolution molecular typing for further understanding of the evolution and epidemiology of assemblage B. Prior studies have reported high sequence diversity but low phylogenetic resolution at standard loci in assemblage B, highlighting the necessity of identifying new markers for accurate and robust molecular typing. Data from comparative analyses of available genomes in this study identified three loci that together form a novel high-resolution typing scheme with high concordance to whole-genome-based phylogenomics and which should aid in future public health endeavors related to this parasite. In addition, data from newly characterized strains suggest evidence of biogeographic and ecologic endemism. 
    more » « less
  5. Viruses play crucial roles in the ecology of microbial communities, yet they remain relatively understudied in their native environments. Despite many advancements in high-throughput whole-genome sequencing (WGS), sequence assembly, and annotation of viruses, the reconstruction of full-length viral genomes directly from metagenomic sequencing is possible only for the most abundant phages and requires long-read sequencing technologies. Additionally, the prediction of their cellular hosts remains difficult from conventional metagenomic sequencing alone. To address these gaps in the field and to accelerate the study of viruses directly in their native microbiomes, we developed an end-to-end bioinformatics platform for viral genome reconstruction and host attribution from metagenomic data using proximity-ligation sequencing (i.e., Hi-C). We demonstrate the capabilities of the platform by recovering and characterizing the metavirome of a variety of metagenomes, including a fecal microbiome that has also been sequenced with accurate long reads, allowing for the assessment and benchmarking of the new methods. The platform can accurately extract numerous near-complete viral genomes even from highly fragmented short-read assemblies and can reliably predict their cellular hosts with minimal false positives. To our knowledge, this is the first software for performing these tasks. Being significantly cheaper than long-read sequencing of comparable depth, the incorporation of proximity-ligation sequencing in microbiome research shows promise to greatly accelerate future advancements in the field. 
    more » « less