On de novo Bridging Paired-end RNA-seq Data

Li, Xiang; Shao, Mingfu

doi:10.1145/3584371.3612987

The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be com- putationally reconstructed from the sequenced two ends in the absence of the reference genome—a problem here we refer to as de novo bridging. Solving this problem provides longer, more infor- mative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splic- ing differential analysis. However, de novo bridging is a challeng- ing and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data pro- vides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying com- pacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijk- stra’s algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Di- jkstra’s algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.

More Like this