The wide array of currently available genomes displays a wonderful diversity in size, composition, and structure and is quickly expanding thanks to several global biodiversity genomics initiatives. However, sequencing of genomes, even with the latest technologies, can still be challenging for both technical (e.g., small physical size, contaminated samples, or access to appropriate sequencing platforms) and biological reasons (e.g., germline-restricted DNA, variable ploidy levels, sex chromosomes, or very large genomes). In recent years,k-mer-based techniques have become popular to overcome some of these challenges. They are based on the simple process of dividing the analyzed sequences (e.g., raw reads or genomes) into a set of subsequences of lengthk, calledk-mers, and then analyzing the frequency or sequences of thosek-mers. Analyses based onk-mers allow for a rapid and intuitive assessment of complex sequencing data sets. Here, we provide a comprehensive review to the theoretical properties and practical applications ofk-mers in biodiversity genomics with a special focus on genome modeling.
more »
« less
Abakus: Accelerating k -mer Counting with Storage Technology
This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel calledk-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer varying degrees of tradeoffs in terms of performance, efficiency, and hardware implementation complexity. The key to these designs is a set of domain-specific hardware extensions to accelerate the key operations fork-mer counting at various levels of the SSD hierarchy, with the goal of enhancing the limited computing capabilities of conventional SSDs, while exploiting the parallelism of the multi-channel, multi-way SSDs. Our evaluation suggests that Abakus can achieve 8.42×, 6.91×, and 2.32× speedup over the CPU-, GPU-, and near-data processing solutions.
more »
« less
- Award ID(s):
- 2120019
- PAR ID:
- 10534719
- Publisher / Repository:
- ACM Digital Library
- Date Published:
- Journal Name:
- ACM Transactions on Architecture and Code Optimization
- Volume:
- 21
- Issue:
- 1
- ISSN:
- 1544-3566
- Page Range / eLocation ID:
- 1 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
SEquence Evaluation throughk-mer Representation (SEEKR) is a method of sequence comparison that uses sequence substrings calledk-mers to quantify the nonlinear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimateP-values that ascribe statistical significance to SEEKR-derived similarities, as well as visualize different aspects ofk-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that containXIST-like sequence features, and we demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and we outline the best practice of visualizing RNA-seq read density to evaluate support for lncRNA annotations before their in-depth study in cell types of interest.more » « less
-
Abstract Motivation Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources. Results We developed a novel computational method called ‘hierarchical representative set selection’ to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4× runtime reduction and 5.35× memory reduction for 10 000 and 12 000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA. Availability and implementation The code is available at https://github.com/Kingsford-Group/hierrepsetselection and https://github.com/Kingsford-Group/jellyfishsim. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Abstract A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available athttp://github.com/medvedevgroup/ESSColor.more » « less
-
Abstract MotivationSequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). ResultsThe effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. Availability and implementationThe code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.more » « less
An official website of the United States government

