NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

k -nonical space: sketching with reverse complements

https://doi.org/10.1093/bioinformatics/btae629

Marçais, Guillaume; Elder, C_S; Kingsford, Carl; Nikolski, ed., Macha (October 2024, Bioinformatics)

Abstract MotivationSequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). ResultsThe effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. Availability and implementationThe code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.
more » « less
Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants

https://doi.org/10.1186/s13015-024-00262-6

Qiu, Yutong; Shen, Yihang; Kingsford, Carl (December 2024, Algorithms for Molecular Biology)

Abstract The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available athttps://github.com/Kingsford-Group/gtednewilp/.
more » « less
Full Text Available
Detecting m6A RNA modification from nanopore sequencing using a semisupervised learning framework

https://doi.org/10.1101/gr.278960.124

Teng, Haotian; Stoiber, Marcus; Bar-Joseph, Ziv; Kingsford, Carl (November 2024, Genome Research)

Direct nanopore-based RNA sequencing can be used to detect posttranscriptional base modifications, such as N6-methyladenosine (m6A) methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder–decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation (IP)-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this data set to train an end-to-end neural network basecaller followed by fine-tuning on IP-based experimental data with label smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly.
more » « less
Full Text Available
How Much Data Is Sufficient to Learn High-Performing Algorithms?

https://doi.org/10.1145/3676278

Balcan, Maria-Florina; Deblasio, Dan; Dick, Travis; Kingsford, Carl; Sandholm, Tuomas; Vitercik, Ellen (October 2024, Journal of the ACM)

Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case instances, however, may be rare or nonexistent in practice. A growing body of research has demonstrated that a data-driven approach to parameter tuning can lead to significant improvements in performance. This approach uses atraining setof problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set. We provide techniques for derivinggeneralization guaranteesthat bound the difference between the algorithm’s average performance over the training set and its expected performance on the unknown distribution. Our results apply no matter how the parameters are tuned, be it via an automated or manual approach. The challenge is that for many types of algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause a large change in behavior. Prior research [e.g.,12,16,20,62] has proved generalization bounds by employing case-by-case analyses of greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We streamline these analyses with a general theorem that applies whenever an algorithm’s performance is a piecewise-constant, piecewise-linear, or—more generally—piecewise-structuredfunction of its parameters. Our results, which are tight up to logarithmic factors in the worst case, also imply novel bounds for configuring dynamic programming algorithms from computational biology.
more » « less
Full Text Available
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets

https://doi.org/10.1089/cmb.2024.0544

Marçais, Guillaume; DeBlasio, Dan; Kingsford, Carl (July 2024, Journal of Computational Biology)

Full Text Available
Graph-Based Genome Inference from Hi-C Data

Shen, Y; Yu, L; Qiu, Y; Zhang, T; Kingsford, Carl (May 2024, Springer Nature)

Full Text Available
A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements

Elder, CS; Hoang, M; Ferdosi, M; Kingsford, C (May 2024, Springer Nature)

Full Text Available
Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation

Hoang, M; Kingsford, C (January 2024, OpenReview)

Full Text Available
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme

https://doi.org/10.1089/cmb.2023.0212

Hoang, Minh; Marçais, Guillaume; Kingsford, Carl (January 2024, Journal of Computational Biology)

Full Text Available
Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection

Shen, Yihang; Kingsford, Carl (January 2023, AutoML Conference 2023)

Full Text Available

« Prev Next »

Search for: All records