Abstract High-throughput sequencing-based methods for bulked segregant analysis (BSA) allow for the rapid identification of genetic markers associated with traits of interest. BSA studies have successfully identified qualitative (binary) and quantitative trait loci (QTLs) using QTL mapping. However, most require population structures that fit the models available and a reference genome. Instead, high-throughput short-read sequencing can be combined with BSA of k-mers (BSA-k-mer) to map traits that appear refractory to standard approaches. This method can be applied to any organism and is particularly useful for species with genomes diverged from the closest sequenced genome. It is also instrumental when dealing with highly heterozygous and potentially polyploid genomes without phased haplotype assemblies and for which a single haplotype can control a trait. Finally, it is flexible in terms of population structure. Here, we apply the BSA-k-mer method for the rapid identification of candidate regions related to seed spot and seed size in diploid potato. Using a mixture of F1 and F2 individuals from a cross between 2 highly heterozygous parents, candidate sequences were identified for each trait using the BSA-k-mer approach. Using parental reads, we were able to determine the parental origin of the loci. Finally, we mapped the identified k-mers to a closely related potato genome to validate the method and determine the genomic loci underlying these sequences. The location identified for the seed spot matches with previously identified loci associated with pigmentation in potato. The loci associated with seed size are novel. Both loci are relevant in future breeding toward true seeds in potato. 
                        more » 
                        « less   
                    
                            
                            RecruitPlotEasy: An Advanced Read Recruitment Plot Tool for Assessing Metagenomic Population Abundance and Genetic Diversity
                        
                    
    
            Mapping of short metagenomic (or metatranscriptomic) read data to reference isolate or single-cell genomes or metagenome-assembled genomes (MAGs) to assess microbial population relative abundance and/or structure represents an essential task of many studies across environmental and clinical settings. The filtering for the quality of the read match and assessment of read mapping results are frequently performed without visual aids or with the assistance of visualizations produced through ad-hoc, in-house approaches. Here, we introduce RecruitPlotEasy, a fully automated, user-friendly pipeline for these purposes that integrates statistical approaches to quantify intra-population sequence and gene-content diversity and identify co-occurring relative populations in the sample. Hence, RecruitPlotEasy should also greatly facilitate population genetics studies. RecruitPlotEasy is implemented in Python and R languages and is freely available open source software under the Artistic License 2.0 from https://github.com/KGerhardt/RecruitPlotEasy . 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10354248
- Date Published:
- Journal Name:
- Frontiers in Bioinformatics
- Volume:
- 1
- ISSN:
- 2673-7647
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Background In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. Results Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read’s similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. Conclusions The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching .more » « less
- 
            Allele-specific expression has been used to elucidate various biological mechanisms, such as genomic imprinting and gene expression variation caused by genetic changes in cis-regulatory elements. However, existing methods for obtaining allele-specific expression from RNA-seq reads do not adequately and efficiently remove various biases, such as reference bias, where reads containing the alternative allele do not map to the reference transcriptome, or ambiguous mapping bias, where reads containing the reference allele map differently from reads containing the alternative allele. We present Ornaments, a computational tool for rapid and accurate estimation of allele-specific expression at unphased heterozygous loci from RNA-seq reads while correcting for allele-specific read mapping bias. Ornaments removes reference bias by mapping reads to a personalized transcriptome, and ambiguous mapping bias by probabilistically assigning reads to multiple transcripts and variant loci they map to. Ornaments is a lightweight extension of kallisto, a popular tool for fast RNA-seq quantification, that improves the efficiency and accuracy of WASP, a popular tool for bias correction in allele-specific read mapping. Our experiments on simulated and human lymphoblastoid cell-line RNA-seq reads with the genomes of the 1000 Genomes Project show that Ornaments is more accurate than WASP and kallisto and nearly as efficient as kallisto per sample, and despite the additional cost of constructing a personalized index for multiple samples, an order of magnitude faster than WASP. In addition, Ornaments detected imprinted transcripts with higher sensitivity, compared to WASP which detected the imprinted signals only at the gene level.more » « less
- 
            Abstract Motivation Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base-calling procedure before using aligners to map partial reads or work well only on small genomes. Results In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields comparable performance on mapping yeast simulated raw signals, and better mapping accuracy on mapping yeast real raw signals with a 4.4× speedup. Moreover, our method performed well on mapping raw signals to genomes of size >100 Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 versus 0.8660). Availability and implementation Sigmap code is accessible at https://github.com/haowenz/sigmap. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
- 
            Gralnick, Jeffrey A. (Ed.)ABSTRACT Reconstructing microbial genomes from metagenomic short-read data can be challenging due to the unknown and uneven complexity of microbial communities. This complexity encompasses highly diverse populations, which often includes strain variants. Reconstructing high-quality genomes is a crucial part of the metagenomic workflow, as subsequent ecological and metabolic inferences depend on their accuracy, quality, and completeness. In contrast to microbial communities in other ecosystems, there has been no systematic assessment of genome-centric metagenomic workflows for drinking water microbiomes. In this study, we assessed the performance of a combination of assembly and binning strategies for time series drinking water metagenomes that were collected over 6 months. The goal of this study was to identify the combination of assembly and binning approaches that result in high-quality and -quantity metagenome-assembled genomes (MAGs), representing most of the sequenced metagenome. Our findings suggest that the metaSPAdes coassembly strategies had the best performance, as they resulted in larger and less fragmented assemblies, with at least 85% of the sequence data mapping to contigs greater than 1 kbp. Furthermore, a combination of metaSPAdes coassembly strategies and MetaBAT2 produced the highest number of medium-quality MAGs while capturing at least 70% of the metagenomes based on read recruitment. Utilizing different assembly/binning approaches also assists in the reconstruction of unique MAGs from closely related species that would have otherwise collapsed into a single MAG using a single workflow. Overall, our study suggests that leveraging multiple binning approaches with different metaSPAdes coassembly strategies may be required to maximize the recovery of good-quality MAGs. IMPORTANCE Drinking water contains phylogenetic diverse groups of bacteria, archaea, and eukarya that affect the esthetic quality of water, water infrastructure, and public health. Taxonomic, metabolic, and ecological inferences of the drinking water microbiome depend on the accuracy, quality, and completeness of genomes that are reconstructed through the application of genome-resolved metagenomics. Using time series metagenomic data, we present reproducible genome-centric metagenomic workflows that result in high-quality and -quantity genomes, which more accurately signifies the sequenced drinking water microbiome. These genome-centric metagenomic workflows will allow for improved taxonomic and functional potential analysis that offers enhanced insights into the stability and dynamics of drinking water microbial communities.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    