Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M

doi:10.1093/bioinformatics/btad512

Citation Details

This content will become publicly available on September 1, 2024

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

Abstract Motivation

The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.

Results

To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

Availability and implementation

MashMap3 is available at https://github.com/marbl/MashMap.

Award ID(s):: 2126387

NSF-PAR ID:: 10469202

Author(s) / Creator(s):: Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M

Editor(s):: Robinson, Peter

Publisher / Repository:: Oxford University Press

Date Published:: 2023-09-01

Journal Name:: Bioinformatics

Volume:: 39

Issue:: 9

ISSN:: 1367-4811

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on September 1, 2024
Journal Article:
https://doi.org/10.1093/bioinformatics/btad512

More Like this