skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 28, 2026

Title: Run-length compressed metagenomic read classification with SMEM-finding and tagging
Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license.  more » « less
Award ID(s):
2029552
PAR ID:
10609225
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
RECOMB SEQ 2025
Sponsoring Org:
National Science Foundation
More Like this
  1. Zhu, Shanfeng (Ed.)
    We present TIPP3 and TIPP3-fast, new tools for abundance profiling in metagenomic datasets. Like its predecessor, TIPP2, the TIPP3 pipeline uses a maximum likelihood approach to place reads into labeled taxonomies using marker genes, but it achieves superior accuracy to TIPP2 by enabling the use of much larger taxonomies through improved algorithmic techniques. We show that TIPP3 is generally more accurate than leading methods for abundance profiling in two important contexts: when reads come from genomes not already in a public database (i.e., novel genomes) and when reads contain sequencing errors. We also show that TIPP3-fast has slightly lower accuracy than TIPP3, but is also generally more accurate than other leading methods and uses a small fraction of TIPP3’s runtime. Additionally, we highlight the potential benefits of restricting abundance profiling methods to those reads that map to marker genes (i.e., using a filtered marker-gene based analysis), which we show typically improves accuracy. TIPP3 is freely available athttps://github.com/c5shen/TIPP3. 
    more » « less
  2. Abstract Horizontal gene transfer (HGT) occurring within microbiomes is linked to complex environmental and ecological dynamics that are challenging to replicate in controlled settings. Consequently, most extant studies of microbiome HGT are either simplistic experimental settings with tenuous relevance to real microbiomes or correlative studies that assume that HGT potential is a function of the relative abundance of mobile genetic elements (MGEs), the vehicles of HGT. Here we introduce Kairos as a bioinformatic tool deployed in nextflow for detecting HGT events “in situ,” i.e., within a microbiome, through analysis of time-series metagenomic sequencing data. Thein-situframework proposed here leverages available metagenomic data from a longitudinally sampled microbiome to assess whether the chronological occurrence of potential donors, recipients, and putatively transferred regions could plausibly have arisen due to HGT over a range of defined time periods. The centerpiece of the Kairos workflow is a novel competitive read alignment method that enables discernment of even very similar genomic sequences, such as those produced by MGE-associated recombination. A key advantage of Kairos is its reliance on assemblies rather than metagenome assembled genomes (MAGs), which avoids systematic exclusion of accessory genes associated with the binning process. In an example test-case of real world data, use of assemblies directly produced a 264-fold increase in the number of antibiotic resistance genes included in the analysis of HGT compared to analysis of MAGs with MetaCHIP. Further,in silicoevaluation of contig taxonomy was performed to assess the accuracy of classification for both chromosomally- and MGE-derived sequences, indicating a high degree of accuracy even for conjugative plasmids up to the level of class or order. Thus, Kairos enables the analysis of very recent HGT events, making it suitable for studying rapid prokaryotic adaptation in environmental systems without disturbing the ornate ecological dynamics associated with microbiomes. Current versions of the Kairos workflow are available here:https://github.com/clb21565/kairos. 
    more » « less
  3. Abstract We present a new method and software tool called that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool available athttps://github.com/alshai/rowbowt. 
    more » « less
  4. Clustering is a fundamental task in machine learning. One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm. DBSCAN requires ϵ-nearest neighbor graphs of the input dataset, which are computed with range-search algorithms and spatial data structures like KD-trees. Despite many efforts to design scalable implementations for DBSCAN, existing work is limited to low-dimensional datasets, as constructing ϵ-nearest neighbor graphs can be expensive in high-dimensions. This article introduces a modified DBSCAN, usingk-nearest neighbor (kNN) graphs to improve efficiency. We outline conditions forkNN-DBSCAN to match DBSCAN’s results and present a parallel implementation using OpenMP and MPI for shared and distributed memory systems. Testing on datasets up to 32 dimensions, we achieve remarkable scalability. Our implementation clusters one billion 3D points in under one second on 28K cores at TACC’s Frontera system. In a larger run, we cluster 65 billion points in 20 dimensions in under 40 seconds using 114,688 cores. Our method is up to 37× faster than state-of-the-art parallel DBSCAN on a 20-dimensional dataset with 4 million points. Code is available athttps://github.com/ut-padas/knndbscan. 
    more » « less
  5. Abstract Immunotherapies have shown promising results in treating patients with hematological malignancies like multiple myeloma, which is an incurable but treatable bone marrow-resident plasma cell cancer. Choosing the most efficacious treatment for a patient remains a challenge in such cancers. However, pre-clinical assays involving patient-derived tumor cells co-cultured in anex vivoreconstruction of immune-tumor micro-environment have gained considerable notoriety over the past decade. Such assays can characterize a patient’s response to several therapeutic agents including immunotherapies in a high-throughput manner, where bright-field images of tumor (target) cells interacting with effector cells (T cells, Natural Killer (NK) cells, and macrophages) are captured once every 30 minutes for upto six days. Cell detection, tracking, and classification of thousands of cells of two or more types in each frame is bound to test the limits of some of the most advanced computer vision tools developed to date and requires a specialized approach. We propose TLCellClassifier (time-lapse cell classifier) for live cell detection, cell tracking, and cell type classification, with enhanced accuracy and efficiency obtained by integrating convolutional neural networks (CNN), metric learning, and long short-term memory (LSTM) networks, respectively. State-of-the-art computer vision software like KTH-SE and YOLOv8 are compared with TLCellClassifier, which shows improved accuracy in detection (CNN) and tracking (metric learning). A two-stage LSTM-based cell type classification method is implemented to distinguish between multiple myeloma (tumor/target) cells and macrophages/monocytes (immune/effector cells). Validation of cell type classification was done both using synthetic datasets andex vivoexperiments involving patient-derived tumor/immune cells. Availability and implementationhttps://github.com/QibingJiang/cell classification ml 
    more » « less