GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Zhao, Jianshu (ORCID:0000000225863852); Both, Jean_Pierre; Rodriguez-R, Luis M. (ORCID:0000000176033093); Konstantinidis, Konstantinos T.

doi:10.1093/nar/gkae609

Citation Details

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Abstract Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification. more »

Award ID(s):: 2129823 1759831

PAR ID:: 10524015

Author(s) / Creator(s):: Zhao, Jianshu; Both, Jean_Pierre; Rodriguez-R, Luis M.; Konstantinidis, Konstantinos T.

Publisher / Repository:: Oxford University Press

Date Published:: 2024-07-16

Journal Name:: Nucleic Acids Research

Volume:: 52

Issue:: 16

ISSN:: 0305-1048

Format(s):: Medium: X Size: p. e74-e74

Size(s):: p. e74-e74

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1093/nar/gkae609

More Like this