skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2019797

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract MotivationHigh-throughput RNA sequencing has become indispensable for decoding gene activities, yet the challenge of reconstructing full-length transcripts persists. Traditional single-sample assemblers frequently produce fragmented transcripts, especially in single-cell RNA-seq data. While algorithms designed for assembling multiple samples exist, they encounter various limitations. ResultsWe present Aletsch, a new assembler for multiple bulk or single-cell RNA-seq samples. Aletsch incorporates several algorithmic innovations, including a “bridging” system that can effectively integrate multiple samples to restore missed junctions in individual samples, and a new graph-decomposition algorithm that leverages “supporting” information across multiple samples to guide the decomposition of complex vertices. A standout feature of Aletsch is its application of a random forest model with 50 well-designed features for scoring transcripts. We demonstrate its robust adaptability across different chromosomes, datasets, and species. Our experiments, conducted on RNA-seq data from several protocols, firmly demonstrate Aletsch’s significant outperformance over existing meta-assemblers. As an example, when measured with the partial area under the precision-recall curve (pAUC, constrained by precision), Aletsch surpasses the leading assemblers TransMeta by 22.9%–62.1% and PsiCLASS by 23.0%–175.5% on human datasets. Availability and implementationAletsch is freely available at https://github.com/Shao-Group/aletsch. Scripts that reproduce the experimental results of this manuscript is available at https://github.com/Shao-Group/aletsch-test. 
    more » « less
  2. Abstract MotivationMany tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and sequencing error, is widely used in these tasks as a measure. The resulting computational problem—to recognize all pairs of sequences within a small edit distance—turns out to be exceedingly difficult, since the edit distance is known to be notoriously expensive to compute and that all-versus-all comparison is simply not acceptable with millions or billions of sequences. Among many attempts, we recently proposed the locality-sensitive bucketing (LSB) functions to meet this challenge. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, with a notable highlight being that optimal LSB functions for certain (d1,d2) exist. LSB functions hold the potential of solving above problems optimally, but the existence of LSB functions for more general (d1,d2) remains unclear, let alone constructing them for practical use. ResultsIn this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can potentially extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2), matching our theoretical results, and high accuracy for many others. Comparing to the state-of-the-art LSH method Order Min Hash, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions. Availability and implementationThe code for the training process and the structure of trained models are freely available at https://github.com/Shao-Group/lsb-learn. 
    more » « less
  3. Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash. 
    more » « less
  4. Abstract BackgroundMany bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existingk-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. ResultsIn this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be$$(d_1, d_2)$$ ( d 1 , d 2 ) -sensitive if any two sequences within an edit distance of$$d_1$$ d 1 are mapped into at least one shared bucket, and any two sequences with distance at least$$d_2$$ d 2 are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of$$(d_1,d_2)$$ ( d 1 , d 2 ) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. ConclusionThese results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions. 
    more » « less
  5. Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with their 5′ and 3′ ends covalently bonded. CircRNAs are known to be more stable than linear RNAs, have distinct properties and functions, and are promising biomarkers. Existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, hence exhibiting unsatisfactory accuracy without a high-quality transcriptome. We present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE uses the splice graph as the underlying data structure that organizes the splicing and coverage information. We transform the problem of assembling circRNAs into finding paths that “bridge” the three fragments in the splice graph induced by back-spliced reads. We adopt a definition for optimal bridging paths and a dynamic programming algorithm to calculate such optimal paths. TERRACE features an efficient algorithm to detect back-spliced reads missed by RNA-seq aligners, contributing to its much-improved sensitivity. It also incorporates a new machine-learning approach trained to assign a confidence score to each assembled circRNA, which is shown to be superior to using abundance for scoring. On both simulations and biological data sets, TERRACE consistently outperforms existing methods by a large margin in sensitivity while achieving better or comparable precision. In particular, when the annotations are not provided, TERRACE assembles 123%–413% more correct circRNAs than state-of-the-art methods. TERRACE presents a significant advance in assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in future research on circRNAs. 
    more » « less
  6. Pissis, Solon P; Sung, Wing-Kin (Ed.)
    Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test. 
    more » « less
  7. Robinson-Rechavi, Marc (Ed.)
    Transcript annotations play a critical role in gene expression analysis as they serve as a reference for quantifying isoform-level expression. The two main sources of annotations are RefSeq and Ensembl/GENCODE, but discrepancies between their methodologies and information resources can lead to significant differences. It has been demonstrated that the choice of annotation can have a significant impact on gene expression analysis. Furthermore, transcript assembly is closely linked to annotations, as assembling large-scale available RNA-seq data is an effective data-driven way to construct annotations, and annotations are often served as benchmarks to evaluate the accuracy of assembly methods. However, the influence of different annotations on transcript assembly is not yet fully understood. We investigate the impact of annotations on transcript assembly. Surprisingly, we observe that opposite conclusions can arise when evaluating assemblers with different annotations. To understand this striking phenomenon, we compare the structural similarity of annotations at various levels and find that the primary structural difference across annotations occurs at the intron-chain level. Next, we examine the biotypes of annotated and assembled transcripts and uncover a significant bias towards annotating and assembling transcripts with intron retentions, which explains above the contradictory conclusions. We develop a standalone tool, available athttps://github.com/Shao-Group/irtool, that can be combined with an assembler to generate an assembly without intron retentions. We evaluate the performance of such a pipeline and offer guidance to select appropriate assembling tools for different application scenarios. 
    more » « less
  8. The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be com- putationally reconstructed from the sequenced two ends in the absence of the reference genome—a problem here we refer to as de novo bridging. Solving this problem provides longer, more infor- mative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splic- ing differential analysis. However, de novo bridging is a challeng- ing and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data pro- vides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying com- pacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijk- stra’s algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Di- jkstra’s algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo. 
    more » « less
  9. In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace. 
    more » « less
  10. Boucher, Christina; Rahmann, Sven (Ed.)
    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions. 
    more » « less