skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: SPUMONI 2: improved classification using a pangenome index of minimizer digests
Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.  more » « less
Award ID(s):
2029552 2013998
PAR ID:
10451197
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Genome Biology
Volume:
24
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ O ( | t | + | p | + 2 ) time and$$\mathcal {O} (\ell \log \ell )$$ O ( log ) space, where |t| is the number of$$k$$ k -mers inside the sketch of the reference, |p| is the number of$$k$$ k -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2. 
    more » « less
  2. Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A. (Ed.)
    Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with “happens-before” relation between them.We argue that it is possible to “unfold” a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single- stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi- stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio- temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction. 
    more » « less
  3. Hutter F., Kersting K. (Ed.)
    A quantification learning task estimates class ratios or class distribution given a test set. Quantification learning is useful for a variety of application domains such as commerce, public health, and politics. For instance, it is desirable to automatically estimate the proportion of customer satisfaction in different aspects from product reviews to improve customer relationships. We formulate the quantification learning problem as a maximum likelihood problem and propose the first end-to-end Deep Quantification Network (DQN) framework. DQN jointly learns quantification feature representations and directly predicts the class distribution. Compared to classification-based quantification methods, DQN avoids three separate steps: classification of individual instances, calculation of the predicted class ratios, and class ratio adjustment to account for classification errors. We evaluated DQN on four public datasets, ranging from movie and product reviews to multi-class news. We compared DQN against six existing quantification methods and conducted a sensitivity analysis of DQN performance. Compared to the best existing method in our study, (1) DQN reduces Mean Absolute Error (MAE) by about 35%. (2) DQN uses around 40% less training samples to achieve a comparable MAE. 
    more » « less
  4. Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6 and 30.7 using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3 LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near optimal on the NVIDIA Tesla V100s. 
    more » « less
  5. Abstract Objective. Intracortical brain–computer interfaces (iBCIs) have demonstrated the ability to enable point and click as well as reach and grasp control for people with tetraplegia. However, few studies have investigated iBCIs during long-duration discrete movements that would enable common computer interactions such as ‘click-and-hold’ or ‘drag-and-drop’.Approach. Here, we examined the performance of multi-class and binary (attempt/no-attempt) classification of neural activity in the left precentral gyrus of two BrainGate2 clinical trial participants performing hand gestures for 1, 2, and 4 s in duration. We then designed a novel ‘latch decoder’ that utilizes parallel multi-class and binary decoding processes and evaluated its performance on data from isolated sustained gesture attempts and a multi-gesture drag-and-drop task.Main results. Neural activity during sustained gestures revealed a marked decrease in the discriminability of hand gestures sustained beyond 1 s. Compared to standard direct decoding methods, the Latch decoder demonstrated substantial improvement in decoding accuracy for gestures performed independently or in conjunction with simultaneous 2D cursor control.Significance. This work highlights the unique neurophysiologic response patterns of sustained gesture attempts in human motor cortex and demonstrates a promising decoding approach that could enable individuals with tetraplegia to intuitively control a wider range of consumer electronics using an iBCI. 
    more » « less