skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.  more » « less
Award ID(s):
2029552
PAR ID:
10604926
Author(s) / Creator(s):
; ; ; ; ; ;
Editor(s):
Liberti, Leo
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
301
ISSN:
1868-8969
ISBN:
978-3-95977-325-6
Page Range / eLocation ID:
10:1-10:13
Subject(s) / Keyword(s):
Taxonomic classification metagenomics KATKA maximal exact matches string kernels minimizer digests Theory of computation → Pattern matching
Format(s):
Medium: X Size: 13 pages; 770375 bytes Other: application/pdf
Size(s):
13 pages 770375 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. Results Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read’s similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. Conclusions The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching . 
    more » « less
  2. Allele-specific expression has been used to elucidate various biological mechanisms, such as genomic imprinting and gene expression variation caused by genetic changes in cis-regulatory elements. However, existing methods for obtaining allele-specific expression from RNA-seq reads do not adequately and efficiently remove various biases, such as reference bias, where reads containing the alternative allele do not map to the reference transcriptome, or ambiguous mapping bias, where reads containing the reference allele map differently from reads containing the alternative allele. We present Ornaments, a computational tool for rapid and accurate estimation of allele-specific expression at unphased heterozygous loci from RNA-seq reads while correcting for allele-specific read mapping bias. Ornaments removes reference bias by mapping reads to a personalized transcriptome, and ambiguous mapping bias by probabilistically assigning reads to multiple transcripts and variant loci they map to. Ornaments is a lightweight extension of kallisto, a popular tool for fast RNA-seq quantification, that improves the efficiency and accuracy of WASP, a popular tool for bias correction in allele-specific read mapping. Our experiments on simulated and human lymphoblastoid cell-line RNA-seq reads with the genomes of the 1000 Genomes Project show that Ornaments is more accurate than WASP and kallisto and nearly as efficient as kallisto per sample, and despite the additional cost of constructing a personalized index for multiple samples, an order of magnitude faster than WASP. In addition, Ornaments detected imprinted transcripts with higher sensitivity, compared to WASP which detected the imprinted signals only at the gene level. 
    more » « less
  3. null (Ed.)
    Abstract A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. 
    more » « less
  4. Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
    Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads’ starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million. 
    more » « less
  5. METHODS: Soil samples (6 total) were collected at the Stordalen Mire site in 2019 from two depths (1-5 & 20-24 cm below ground) across three habitats (Palsa, Bog, and Fen). DNA was extracted based on the protocol described by Li et al. (2024). For short reads, libraries were prepared at the Joint Genome Institute (JGI) with the KAPA Hyperprep kit, and sequenced with Illumina NovaSeq 6000. For long reads, libraries were prepared with the SMRTbell Express Template Prep Kit 2.0 (PacBio), then sequenced using PacBio Sequel IIe at JGI. PacBio data was processed at JGI to form filtered CCS (Circular Consensus Sequencing) reads.  Assemblies were generated with short-only, long-only, and hybrid read sources: Short-only was assembled with metaSPAdes (v3.15.4) using Aviary (v0.5.3) with default parameters. Long-only was assembled with metaFlye (v2.9-b1768) using Aviary (v0.5.3) with default parameters. Hybrid assembly was performed using Aviary v0.5.3 with default parameters. This involved a step-down procedure with long-read assembly through metaFlye (v2.9-b1768), followed by short-read polishing by Racon (v1.4.3), Pilon (v1.24) and then Racon again. Next, reads that didn't map to high-quality metaFlye contigs were hybrid assembled with SPAdes (--meta option) and binned out with MetaBAT2 (v2.1.5). For each bin, the reads within the bin were hybrid assembled using Unicycler (v0.4.8). The high-coverage metaFlye contigs and Unicycler contigs were then combined to form the assembly fasta file. Genome recovery was performed using Aviary v0.5.3 with samples chosen for differential abundance binning by Bin Chicken (v0.4.2) using SingleM metapackage S3.0.5. This involved initial read mapping through CoverM (v0.6.1) using minimap2 (v2.18) and binning by MetaBAT, MetaBAT2 (v2.1.5), VAMB (v3.0.2), SemiBin (v1.3.1), Rosella (v0.4.2), CONCOCT (v1.1.0) and MaxBin2 (v2.2.7). Genomes were analyzed using CheckM2 (v1.0.2) and clustered at 95% ANI using Galah (v0.4.0).   FILES: EMERGE_MAGs_2019_long-short-hybrid.tar.gz - Archive containing the MAG files (.fna). metadata_MAGs_2019_EMERGE.tsv - Table containing source sample names and accessions, GTDB classifications, CheckM2 quality information, NCBI GenomeBatch- and MIMAG(6.0)-formatted attributes, and other metadata for the MAGs.   FUNDING: This research is a contribution of the EMERGE Biology Integration Institute (https://emerge-bii.github.io/), funded by the National Science Foundation, Biology Integration Institutes Program, Award # 2022070. This study was also funded by the Genomic Science Program of the United States Department of Energy Office of Biological and Environmental Research, grant #s DE-SC0004632. DE-SC0010580. and DE-SC0016440. We thank the Swedish Polar Research Secretariat and SITES for the support of the work done at the Abisko Scientific Research Station. SITES is supported by the Swedish Research Council's grant 4.3-2021-00164. Data from the Joint Genome Institute (JGI) was collected under BER Support Science Proposal 503530 (DOI: 10.46936/10.25585/60001148), conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. 
    more » « less