skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Applying the Safe-And-Complete Framework to Practical Genome Assembly
Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs (simple omnitigs), giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the D. melanogaster and the C. elegans genomes. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible additional computational costs and either no or a small increase in the number of misassemblies.  more » « less
Award ID(s):
2138585 1931531
PAR ID:
10616429
Author(s) / Creator(s):
; ; ;
Editor(s):
Pissis, Solon P; Sung, Wing-Kin
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
312
ISSN:
1868-8969
ISBN:
978-3-95977-340-9
Page Range / eLocation ID:
8:1-8:16
Subject(s) / Keyword(s):
Genome assembly Omnitigs Safe-and-complete framework graph algorithm HiFi sequencing data Assembly evaluation Applied computing → Computational biology Mathematics of computing → Paths and connectivity problems Theory of computation → Graph algorithms analysis
Format(s):
Medium: X Size: 16 pages; 1032186 bytes Other: application/pdf
Size(s):
16 pages 1032186 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this paper we describe a new parallel algorithm called Fast Adaptive Sequencing Technique (FAST) for maximizing a monotone submodular function under a cardinality constraint k. This algorithm achieves the optimal 1-1/e approximation guarantee and is orders of magnitude faster than the state-of-the-art on a variety of experiments over real-world data sets. Following recent work by Balkanski & Singer (2018a), there has been a great deal of research on algorithms whose theoretical parallel runtime is exponentially faster than algorithms used for sub- modular maximization over the past 40 years. However, while these new algorithms are fast in terms of asymptotic worst-case guarantees, it is computationally infeasible to use them in practice even on small data sets because the number of rounds and queries they require depend on large constants and high-degree polynomials in terms of precision and confidence. The design principles behind the FAST algorithm we present here are a significant departure from those of recent theoretically fast algorithms. Rather than optimize for asymptotic theoretical guarantees, the design of FAST introduces several new techniques that achieve remarkable practical and theoretical parallel runtimes. The approximation guarantee obtained by FAST is arbitrarily close to 1-1/e, and its asymptotic parallel runtime (adaptivity) is O(log(n) log2(log k)) using O(n log log(k)) total queries. We show that FAST is orders of magnitude faster than any algorithm for submodular maximization we are aware of, including hyper-optimized parallel versions of state-of-the-art serial algorithms, by running experiments on large data sets. 
    more » « less
  2. The construction of bounded-degree plane geometric spanners has been a focus of interest since 2002 when Bose, Gudmundsson, and Smid proposed the first algorithm to construct such spanners. To date, 11 algorithms have been designed with various tradeoffs in degree and stretch-factor. We have implemented these sophisticated spanner algorithms in C ++ using the CGAL library and experimented with them using large synthetic and real-world pointsets. Our experiments have revealed their practical behavior and real-world efficacy. We share the implementations via GitHub for broader uses and future research. We design and engineer EstimateStretchFactor , a simple practical algorithm, which can estimate stretch-factors (obtains lower bounds on the exact stretch-factors) of geometric spanners—a challenging problem for which no practical algorithm is known yet. In our experiments with bounded-degree plane geometric spanners, we found that EstimateStretchFactor estimated stretch-factors almost precisely. Further, it gave linear runtime performance in practice for the pointset distributions considered in this work, making it much faster than the naive Dijkstra-based algorithm for calculating stretch-factors. 
    more » « less
  3. Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune. 
    more » « less
  4. Due to mainstream adoption of cloud computing and its rapidly increasing usage of energy, the efficient management of cloud computing resources has become an important issue. A key challenge in managing the resources lies in the volatility of their demand. While there have been a wide variety of online algorithms (e.g. Receding Horizon Control, Online Balanced Descent) designed, it is hard for cloud operators to pick the right algorithm. In particular, these algorithms vary greatly on their usage of predictions and performance guarantees. This paper aims at studying an automatic algorithm selection scheme in real time. To do this, we empirically study the prediction errors from real-world cloud computing traces. Results show that prediction errors are distinct from different prediction algorithms, across virtual machines, and over the time horizon. Based on these observations, we propose a simple prediction error model and prove upper bounds on the dynamic regret of several online algorithms. We then apply the empirical and theoretical results to create a simple online meta-algorithm that chooses the best algorithm on the fly. Numerical simulations demonstrate that the performance of the designed policy is close to that of the best algorithm in hindsight. 
    more » « less
  5. null (Ed.)
    To this day, the maximum clique problem remains a computationally challenging problem. Indeed, despite researchers’ best efforts, there exist unsolved benchmark instances with 1,000 vertices. However, relatively simple algorithms solve real-life instances with millions of vertices in a few seconds. Why is this the case? Why is the problem apparently so easy in many naturally occurring networks? In this paper, we provide an explanation. First, we observe that the graph’s clique number ω is very near to the graph’s degeneracy d in most real-life instances. This observation motivates a main contribution of this paper, which is an algorithm for the maximum clique problem that runs in time polynomial in the size of the graph, but exponential in the gap [Formula: see text] between the clique number ω and its degeneracy-based upper bound d+1. When this gap [Formula: see text] can be treated as a constant, as is often the case for real-life graphs, the proposed algorithm runs in time [Formula: see text]. This provides a rigorous explanation for the apparent easiness of these instances despite the intractability of the problem in the worst case. Further, our implementation of the proposed algorithm is actually practical—competitive with the best approaches from the literature. 
    more » « less