skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 31, 2026

Title: k -mer approaches for biodiversity genomics
The wide array of currently available genomes displays a wonderful diversity in size, composition, and structure and is quickly expanding thanks to several global biodiversity genomics initiatives. However, sequencing of genomes, even with the latest technologies, can still be challenging for both technical (e.g., small physical size, contaminated samples, or access to appropriate sequencing platforms) and biological reasons (e.g., germline-restricted DNA, variable ploidy levels, sex chromosomes, or very large genomes). In recent years,k-mer-based techniques have become popular to overcome some of these challenges. They are based on the simple process of dividing the analyzed sequences (e.g., raw reads or genomes) into a set of subsequences of lengthk, calledk-mers, and then analyzing the frequency or sequences of thosek-mers. Analyses based onk-mers allow for a rapid and intuitive assessment of complex sequencing data sets. Here, we provide a comprehensive review to the theoretical properties and practical applications ofk-mers in biodiversity genomics with a special focus on genome modeling.  more » « less
Award ID(s):
2216612
PAR ID:
10591504
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Cold Spring Harbor Laboratory Press
Date Published:
Journal Name:
Genome Research
ISSN:
1088-9051
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract High-throughput sequencing-based methods for bulked segregant analysis (BSA) allow for the rapid identification of genetic markers associated with traits of interest. BSA studies have successfully identified qualitative (binary) and quantitative trait loci (QTLs) using QTL mapping. However, most require population structures that fit the models available and a reference genome. Instead, high-throughput short-read sequencing can be combined with BSA of k-mers (BSA-k-mer) to map traits that appear refractory to standard approaches. This method can be applied to any organism and is particularly useful for species with genomes diverged from the closest sequenced genome. It is also instrumental when dealing with highly heterozygous and potentially polyploid genomes without phased haplotype assemblies and for which a single haplotype can control a trait. Finally, it is flexible in terms of population structure. Here, we apply the BSA-k-mer method for the rapid identification of candidate regions related to seed spot and seed size in diploid potato. Using a mixture of F1 and F2 individuals from a cross between 2 highly heterozygous parents, candidate sequences were identified for each trait using the BSA-k-mer approach. Using parental reads, we were able to determine the parental origin of the loci. Finally, we mapped the identified k-mers to a closely related potato genome to validate the method and determine the genomic loci underlying these sequences. The location identified for the seed spot matches with previously identified loci associated with pigmentation in potato. The loci associated with seed size are novel. Both loci are relevant in future breeding toward true seeds in potato. 
    more » « less
  2. Characterization of antibiotic resistance genes (ARGs) from high-throughput sequencing data of metagenomics and cultured bacterial samples is a challenging task, with the need to account for both computational (e.g., string algorithms) and biological (e.g., gene transfers, rearrangements) aspects. Curated ARG databases exist together with assorted ARG classification approaches (e.g., database alignment, machine learning). Besides ARGs that naturally occur in bacterial strains or are acquired through mobile elements, there are chromosomal genes that can render a bacterium resistant to antibiotics through point mutations, i.e., ARG variants (ARGVs). While ARG repositories also collect ARGVs, there are only a few tools that are able to identify ARGVs from metagenomics and high throughput sequencing data, with a number of limitations (e.g., pre-assembly,a posterioriverification of mutations, or specification of species). In this work we present thek-mer, i.e., strings of fixed lengthk, ARGV analyzer – KARGVA – an open-source, multi-platform tool that provides: (i) anad hoc, large ARGV database derived from multiple sources; (ii) input capability for various types of high-throughput sequencing data; (iii) a three-way, hash-based,k-mer search setup to process data efficiently, linkingk-mers to ARGVs,k-mers to point mutations, and ARGVs tok-mers, respectively; (iv) a statistical filter on sequence classification to reduce type I and II errors. On semi-synthetic data, KARGVA provides very high accuracy even in presence of high sequencing errors or mutations (99.2 and 86.6% accuracy within 1 and 5% base change rates, respectively), and genome rearrangements (98.2% accuracy), with robust performance onad hocfalse positive sets. On data from the worldwide MetaSUB consortium, comprising 3,700+ metagenomics experiments, KARGVA identifies more ARGVs than Resistance Gene Identifier (4.8x) and PointFinder (6.8x), yet all predictions are below the expected false positive estimates. The prevalence of ARGVs is correlated to ARGs but ecological characteristics do not explain well ARGV variance. KARGVA is publicly available athttps://github.com/DataIntellSystLab/KARGVAunder MIT license. 
    more » « less
  3. Abstract Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed themsa_pipelineworkflow (https://bitbucket.org/bucklerlab/msa_pipeline) to allow practical and sensitive multiple alignment of diverged plant genomes and calculation of conservation scores with minimal user inputs. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the effect of different masking approaches and parameters of the LAST aligner using genome assemblies of 33 grass species. Compared with conventional masking with RepeatMasker, a masking approach based onk‐mers (nucleotide sequences ofklength) increased the alignment rate of coding sequence and noncoding functional regions by 25 and 14%, respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for noncoding functional regions by over 52% compared with default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of noncoding sites that can be scored for conservation by over 76%. Overall, tuning of masking and alignment parameters can generate optimized multiple alignments to drive biological discovery in plants. 
    more » « less
  4. Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k -mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium , Breviolum , Cladocopium , Durusdinium and Fugacium ) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k -mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k -mers revealed the prevalence of conserved k -mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes. 
    more » « less
  5. The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences. 
    more » « less