skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SPUMONI 2: improved classification using a pangenome index of minimizer digests
Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.  more » « less
Award ID(s):
2029552 2013998
PAR ID:
10414403
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Genome Biology
Volume:
24
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ O ( | t | + | p | + 2 ) time and$$\mathcal {O} (\ell \log \ell )$$ O ( log ) space, where |t| is the number of$$k$$ k -mers inside the sketch of the reference, |p| is the number of$$k$$ k -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2. 
    more » « less
  2. Abstract Objective. Intracortical brain–computer interfaces (iBCIs) have demonstrated the ability to enable point and click as well as reach and grasp control for people with tetraplegia. However, few studies have investigated iBCIs during long-duration discrete movements that would enable common computer interactions such as ‘click-and-hold’ or ‘drag-and-drop’.Approach. Here, we examined the performance of multi-class and binary (attempt/no-attempt) classification of neural activity in the left precentral gyrus of two BrainGate2 clinical trial participants performing hand gestures for 1, 2, and 4 s in duration. We then designed a novel ‘latch decoder’ that utilizes parallel multi-class and binary decoding processes and evaluated its performance on data from isolated sustained gesture attempts and a multi-gesture drag-and-drop task.Main results. Neural activity during sustained gestures revealed a marked decrease in the discriminability of hand gestures sustained beyond 1 s. Compared to standard direct decoding methods, the Latch decoder demonstrated substantial improvement in decoding accuracy for gestures performed independently or in conjunction with simultaneous 2D cursor control.Significance. This work highlights the unique neurophysiologic response patterns of sustained gesture attempts in human motor cortex and demonstrates a promising decoding approach that could enable individuals with tetraplegia to intuitively control a wider range of consumer electronics using an iBCI. 
    more » « less
  3. Alkan, Can (Ed.)
    Abstract MotivationDetection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments. ResultsWe show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome. Availability and implementationhttps://github.com/joshidhaivat/HQAlign.git. 
    more » « less
  4. Abstract Widespread manganese-sensing transcriptional riboswitches effect the dependable gene regulation needed for bacterial manganese homeostasis in changing environments. Riboswitches – like most structured RNAs – are believed to fold co-transcriptionally, subject to both ligand binding and transcription events; yet how these processes are orchestrated for robust regulation is poorly understood. Through a combination of single-molecule and bulk approaches, we discover how a single Mn2+ion and the transcribing RNA polymerase (RNAP), paused immediately downstream by a DNA template sequence, are coordinated by the bridging switch helix P1.1 in the representativeLactococcus lactisriboswitch. This coordination achieves a heretofore-overlooked semi-docked global conformation of the nascent RNA, P1.1 base pair stabilization, transcription factor NusA ejection, and RNAP pause extension, thereby enforcing transcription readthrough. Our work demonstrates how a central, adaptable RNA helix functions analogous to a molecular fulcrum of a first-class lever system to integrate disparate signals for finely balanced gene expression control. 
    more » « less
  5. Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A. (Ed.)
    Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with “happens-before” relation between them.We argue that it is possible to “unfold” a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single- stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi- stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio- temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction. 
    more » « less