skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach
Abstract The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35–0.65 bits per base which is 3–6$$\times$$ × lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4$$\times$$ × faster decompression with 20 threads). NanoSpring is available on GitHub athttps://github.com/qm2/NanoSpring.  more » « less
Award ID(s):
2106467
PAR ID:
10478478
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Scientific reports
Date Published:
Journal Name:
Scientific Reports
Volume:
13
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundGenome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. ResultsIn this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called , is aimed at generating long scaffolds of a target genome, from two sets of input sequences—an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a$$\langle $$ contig,contig$$\rangle $$ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic “wiring” heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. ConclusionsOur experiments with on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of to generate significantly longer scaffolds in low coverage settings ($$1\times $$ 1 × –$$10\times $$ 10 × ). 
    more » « less
  2. Abstract FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for$$\texttt {PFP-FM}$$ PFP - FM is available athttps://github.com/AaronHong1024/afm. 
    more » « less
  3. Abstract Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, calledGiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5$$\times$$ × speed improvement over its CPU-only predecessor, HiCOPS, and over 10$$\times$$ × improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at:https://github.com/pcdslab/gicops. 
    more » « less
  4. Abstract Deep Neural Networks (DNNs) are increasingly deployed in critical applications, where ensuring their safety and robustness is paramount. We present$$_\text {CAV25}$$ CAV 25 , a high-performance DNN verification tool that uses the DPLL(T) framework and supports a wide-range of network architectures and activation functions. Since its debut in VNN-COMP’23, in which it achieved the New Participant Award and ranked 4th overall,$$_\text {CAV25}$$ CAV 25 has advanced significantly, achieving second place in VNN-COMP’24. This paper presents and evaluates the latest development of$$_\text {CAV25}$$ CAV 25 , focusing on the versatility, ease of use, and competitive performance of the tool.$$_\text {CAV25}$$ CAV 25 is available at:https://github.com/dynaroars/neuralsat. 
    more » « less
  5. Abstract Given a suitable solutionV(t, x) to the Korteweg–de Vries equation on the real line, we prove global well-posedness for initial data$$u(0,x) \in V(0,x) + H^{-1}(\mathbb {R})$$ u ( 0 , x ) V ( 0 , x ) + H - 1 ( R ) . Our conditions onVdo include regularity but do not impose any assumptions on spatial asymptotics. We show that periodic profiles$$V(0,x)\in H^5(\mathbb {R}/\mathbb {Z})$$ V ( 0 , x ) H 5 ( R / Z ) satisfy our hypotheses. In particular, we can treat localized perturbations of the much-studied periodic traveling wave solutions (cnoidal waves) of KdV. In the companion paper Laurens (Nonlinearity. 35(1):343–387, 2022.https://doi.org/10.1088/1361-6544/ac37f5) we show that smooth step-like initial data also satisfy our hypotheses. We employ the method of commuting flows introduced in Killip and Vişan (Ann. Math. (2) 190(1):249–305, 2019.https://doi.org/10.4007/annals.2019.190.1.4) where$$V\equiv 0$$ V 0 . In that setting, it is known that$$H^{-1}(\mathbb {R})$$ H - 1 ( R ) is sharp in the class of$$H^s(\mathbb {R})$$ H s ( R ) spaces. 
    more » « less