skip to main content


Search for: All records

Award ID contains: 1844234

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available April 1, 2025
  2. Free, publicly-accessible full text available March 1, 2025
  3. Free, publicly-accessible full text available February 20, 2025
  4. This paper introduces SyncSignature, the first fully parallelizable algorithmic framework for tree similarity joins under edit distance. SyncSignature makes use of implicit-synchronized signature generation schemes, which allow for an efficient and parallelizable candidate-generation procedure via hash join. Our experiments on large real-world datasets show that the proposed algorithms under the SyncSignature framework significantly outperform the state-of-the-art algorithm in the parallel computation environment. For datasets with big trees, they also exceed the state-of-the-art algorithms by a notable margin in the centralized/single-thread computation environment. To complement and guide the experimental study, we also provide a thorough theoretical analysis for all proposed signature generation schemes. 
    more » « less
    Free, publicly-accessible full text available August 28, 2024
  5. Free, publicly-accessible full text available June 27, 2024
  6. null (Ed.)
  7. null (Ed.)
  8. Yann, Ponty (Ed.)
    Abstract Motivation Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework. Results We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. Availability and implementation The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  9. null (Ed.)