Abstract We present a new method and software tool called that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool available at https://github.com/alshai/rowbowt .
more »
« less
Accurate short-read alignment through r-index-based pangenome indexing
Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r-index, a variation of the classical FM-index that can index collections of genomes in O(r)-space, whereris the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with ther-index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability.
more »
« less
- Award ID(s):
- 2029552
- PAR ID:
- 10609222
- Publisher / Repository:
- CSHL
- Date Published:
- Journal Name:
- Genome Research
- ISSN:
- 1088-9051
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Pangenomes are becoming increasingly popular data structures for genomics analyses due to their ability to compactly represent the genetic diversity within populations. Constructing a pangenome graph, however, is still a time-consuming and expensive process. A promising approach for pangenome construction consists of progressively augmenting a pangenome graph with additional high-quality assemblies. Currently, there is no method for augmenting a pangenome graph with unassembled reads from newly sequenced samples without first aligning the reads to a reference genome and performing variant calling and genotyping on the new individuals. In this work, we present the first assembly-free and mapping-free approach for augmenting an existing pangenome graph using unassembled long reads from an individual not already present in the pangenome. Our approach consists of finding sample specific sequences in reads using efficient indexes, clustering reads corresponding to the same novel variant(s), and then building a consensus sequence to be added to the pangenome graph for each variant separately. Using simulated reads based on Human Pangenome Reference Consortium (HPRC) assemblies, we demonstrate the effectiveness of the proposed approach for progressively augmenting the pangenome with long reads, without the need forde novoassembly or predicting genetic variants of the new sample. The software is freely available athttps://github.com/ldenti/palss.more » « less
-
Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.more » « less
-
Summary: Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10–100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications. Availability and implementation: Sigmoni is implemented in Python, and is available open-source at https://github.com/vshiv18/sigmonimore » « less
-
Abstract BackgroundCapturing the genetic diversity of wild relatives is crucial for improving crops because wild species are valuable sources of agronomic traits that are essential to enhance the sustainability and adaptability of domesticated cultivars. Genetic diversity across a genus can be captured in super-pangenomes, which provide a framework for interpreting genomic variations. ResultsHere we report the sequencing, assembly, and annotation of nine wild North American grape genomes, which are phased and scaffolded at chromosome scale. We generate a reference-unbiased super-pangenome using pairwise whole-genome alignment methods, revealing the extent of the genomic diversity among wild grape species from sequence to gene level. The pangenome graph captures genomic variation between haplotypes within a species and across the different species, and it accurately assesses the similarity of hybrids to their parents. The species selected to build the pangenome are a great representation of the genus, as illustrated by capturing known allelic variants in the sex-determining region and for Pierce’s disease resistance loci. Using pangenome-wide association analysis, we demonstrate the utility of the super-pangenome by effectively mapping short reads from genus-wide samples and identifying loci associated with salt tolerance in natural populations of grapes. ConclusionsThis study highlights how a reference-unbiased super-pangenome can reveal the genetic basis of adaptive traits from wild relatives and accelerate crop breeding research.more » « less
An official website of the United States government

