skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 31, 2026

Title: KNN-DBSCAN: a DBSCAN in high dimensions
Clustering is a fundamental task in machine learning. One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm. DBSCAN requires ϵ-nearest neighbor graphs of the input dataset, which are computed with range-search algorithms and spatial data structures like KD-trees. Despite many efforts to design scalable implementations for DBSCAN, existing work is limited to low-dimensional datasets, as constructing ϵ-nearest neighbor graphs can be expensive in high-dimensions. This article introduces a modified DBSCAN, usingk-nearest neighbor (kNN) graphs to improve efficiency. We outline conditions forkNN-DBSCAN to match DBSCAN’s results and present a parallel implementation using OpenMP and MPI for shared and distributed memory systems. Testing on datasets up to 32 dimensions, we achieve remarkable scalability. Our implementation clusters one billion 3D points in under one second on 28K cores at TACC’s Frontera system. In a larger run, we cluster 65 billion points in 20 dimensions in under 40 seconds using 114,688 cores. Our method is up to 37× faster than state-of-the-art parallel DBSCAN on a 20-dimensional dataset with 4 million points. Code is available athttps://github.com/ut-padas/knndbscan.  more » « less
Award ID(s):
2204226
PAR ID:
10642509
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Association for Computing Machinery
Date Published:
Journal Name:
ACM Transactions on Parallel Computing
Volume:
12
Issue:
1
ISSN:
2329-4949
Page Range / eLocation ID:
1 to 27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Discovery of target‐binding molecules, such as aptamers and peptides, is usually performed with the use of high‐throughput experimental screening methods. These methods typically generate large datasets of sequences of target‐binding molecules, which can be enriched with high affinity binders. However, the identification of the highest affinity binders from these large datasets often requires additional low‐throughput experiments or other approaches. Bioinformatics‐based analyses could be helpful to better understand these large datasets and identify the parts of the sequence space enriched with high affinity binders.BinderSpaceis an open‐source Python package that performs motif analysis, sequence space visualization, clustering analyses, and sequence extraction from clusters of interest. The motif analysis, resulting in text‐based and visual output of motifs, can also provide heat maps of previously measured user‐defined functional properties for all the motif‐containing molecules. Users can also run principal component analysis (PCA) and t‐distributed stochastic neighbor embedding (t‐SNE) analyses on whole datasets and on motif‐related subsets of the data. Functionally important sequences can also be highlighted in the resulting PCA and t‐SNE maps. If points (sequences) in two‐dimensional maps in PCA or t‐SNE space form clusters, users can perform clustering analyses on their data, and extract sequences from clusters of interest. We demonstrate the use ofBinderSpaceon a dataset of oligonucleotides binding to single‐wall carbon nanotubes in the presence and absence of a bioanalyte, and on a dataset of cyclic peptidomimetics binding to bovine carbonic anhydrase protein.BinderSpaceis openly accessible to the public via the GitHub website:https://github.com/vukoviclab/BinderSpace. 
    more » « less
  2. Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license. 
    more » « less
  3. Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data. 
    more » « less
  4. Abstract The widespread misuse of antibiotics has escalated antibiotic resistance into a critical global public health concern. Beyond antibiotics, metals function as antibacterial agents. Metal resistance genes (MRGs) enable bacteria to tolerate metal-based antibacterials and may also foster antibiotic resistance within bacterial communities through co-selection. Thus, predicting bacterial MRGs is vital for elucidating their involvement in antibiotic resistance and metal tolerance mechanisms. The “best hit” approach is mainly utilized to identify and annotate MRGs. This method is sensitive to cutoff values and produces a high false negative rate. Other than the best hit approach, only a few antimicrobial resistance (AMR) detection tools exist for predicting MRGs. However, these tools lack comprehensive annotation for MRGs conferring resistance to multiple metals. To address such limitations, we introduce DeepMRG, a deep learning-based multi-label classifier, to predict bacterial MRGs. Because a bacterial MRG can confer resistance to multiple metals, DeepMRG is designed as a multi-label classifier capable of predicting multiple metal labels associated with an MRG. It leverages bit score-based similarity distribution of sequences with experimentally verified MRGs. To ensure unbiased model evaluation, we employed a clustering method to partition our dataset into six subsets, five for cross-validation and one for testing, with non-homologous sequences, mitigating the impact of sequence homology. DeepMRG consistently achieved high overall F1-scores and significantly reduced false negative rates across a wide range of datasets. It can be used to predict bacterial MRGs in metagenomic or isolate assemblies. The web server of DeepMRG can be accessed athttps://deepmrg.cs.vt.edu/deepmrgand the source code is available athttps://github.com/muhit-emon/DeepMRGunder the MIT license. 
    more » « less
  5. Abstract This paper investigates convex quadratic optimization problems involvingnindicator variables, each associated with a continuous variable, particularly focusing on scenarios where the matrixQdefining the quadratic term is positive definite and its sparsity pattern corresponds to the adjacency matrix of a tree graph. We introduce a graph-based dynamic programming algorithm that solves this problem in time and memory complexity of$$\mathcal {O}(n^2)$$ O ( n 2 ) . Central to our algorithm is a precise parametric characterization of the cost function across various nodes of the graph corresponding to distinct variables. Our computational experiments conducted on both synthetic and real-world datasets demonstrate the superior performance of our proposed algorithm compared to existing algorithms and state-of-the-art mixed-integer optimization solvers. An important application of our algorithm is in the real-time inference of Gaussian hidden Markov models from data affected by outlier noise. Using a real on-body accelerometer dataset, we solve instances of this problem with over 30,000 variables in under a minute, and its online variant within milliseconds on a standard computer. A Python implementation of our algorithm is available athttps://github.com/aareshfb/Tree-Parametric-Algorithm.git. 
    more » « less