skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: k -nonical space: sketching with reverse complements
Abstract MotivationSequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). ResultsThe effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. Availability and implementationThe code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.  more » « less
Award ID(s):
2232121
PAR ID:
10554409
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
40
Issue:
11
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract PremiseA probe set was previously designed to target 384 nuclear loci in the Melastomataceae family; however, when trying to use it, we encountered several practical and conceptual problems, such as the presence of sequences in reverse complement, intronic regions with stop codons, and other issues. This raised concerns regarding the use of this probe set for sequence recovery in Melastomataceae. MethodsIn order to correct these issues, we cleaned the Melastomataceae probe set, extended it with additional sequences, and compared its performance with the original version. ResultsThe final probe set targets 396 putative nuclear loci represented by 6009 template sequences. The probe set has been made available, along with details on the cleaning process, for reproducibility. We show that the new probe set performs better than the original version in terms of sequence recovery. DiscussionThis updated, extended, and cleaned probe set will improve the availability of phylogenomic resources across the Melastomataceae family. It is fully compatible with sequence recovery and extraction pipelines. The cleaning process can also be applied to any plant‐targeting probe set that would need to be cleaned or updated if new genomic resources for the targeted taxa become available. 
    more » « less
  2. Abstract MotivationThe mapping from codon to amino acid is surjective due to codon degeneracy, suggesting that codon space might harbor higher information content. Embeddings from the codon language model have recently demonstrated success in various protein downstream tasks. However, predictive models for residue-level tasks such as phosphorylation sites, arguably the most studied Post-Translational Modification (PTM), and PTM sites prediction in general, have predominantly relied on representations in amino acid space. ResultsWe introduce a novel approach for predicting phosphorylation sites by utilizing codon-level information through embeddings from the codon adaptation language model (CaLM), trained on protein-coding DNA sequences. Protein sequences are first reverse-translated into reliable coding sequences by mapping UniProt sequences to their corresponding NCBI reference sequences and extracting the exact coding sequences from their GenBank format using a dynamic programming-based global pairwise alignment. The resulting coding sequences are encoded using the CaLM encoder to generate codon-aware embeddings, which are subsequently integrated with amino acid-aware embeddings obtained from a protein language model, through an early fusion strategy. Next, a window-level representation of the site of interest, retaining the full sequence context, is constructed from the fused embeddings. A ConvBiGRU network extracts feature maps that capture spatiotemporal correlations between proximal residues within the window. This is followed by a prediction head based on a Kolmogorov-Arnold network (KAN) using the derivative of gaussian wavelet transform to generate the inference for the site. The overall model, dubbed CaLMPhosKAN, performs better than the existing approaches across multiple datasets. Availability and implementationCaLMPhosKAN is publicly available at https://github.com/KCLabMTU/CaLMPhosKAN. 
    more » « less
  3. Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash. 
    more » « less
  4. Abstract MotivationSketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. ResultsWe show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. Availability and implementationScripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. A longstanding observation, which was partially proven in \cite{LNW14,AHLW16}, is that any turnstile streaming algorithm can be implemented as a linear sketch (the reverse is trivially true). We study the relationship between turnstile streaming and linear sketching algorithms in more detail, giving both new separations and new equivalences between the two models. It was shown in \cite{LNW14} that, if a turnstile algorithm works for arbitrarily long streams with arbitrarily large coordinates at intermediate stages of the stream, then the turnstile algorithm is equivalent to a linear sketch. We show separations of the opposite form: if either the stream length or the maximum value of the stream are substantially restricted, there exist problems where linear sketching is exponentially harder than turnstile streaming. A further limitation of the \cite{LNW14} equivalence is that the turnstile sketching algorithm is neither explicit nor uniform, but requires an exponentially long advice string. We show how to remove this limitation for deterministic streaming algorithms: we give an explicit small-space algorithm that takes the streaming algorithm and computes an equivalent module. 
    more » « less