The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35–0.65 bits per base which is 3–6
Direct-sequencing technologies, such as Oxford Nanopore’s, are delivering long RNA reads with great efficacy and convenience. These technologies afford an ability to detect post-transcriptional modifications at a single-molecule resolution, promising new insights into the functional roles of RNA. However, realizing this potential requires new tools to analyze and explore this type of data.
Here, we present Sequoia, a visual analytics tool that allows users to interactively explore nanopore sequences. Sequoia combines a Python-based backend with a multi-view visualization interface, enabling users to import raw nanopore sequencing data in a Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to identify properties of interest. We demonstrate the application of Sequoia by generating and analyzing ~ 500k reads from direct RNA sequencing data of human HeLa cell line. We focus on comparing signal features from m6A and m5C RNA modifications as the first step towards building automated classifiers. We show how, through iterative visual exploration and tuning of dimensionality reduction parameters, we can separate modified RNA sequences from their unmodified counterparts. We also document new, qualitative signal signatures that characterize these modifications from otherwise normal RNA bases, which we were able to discover from the visualization.
Sequoia’s interactive features complement existing computational approaches in nanopore-based RNA workflows. The insights gleaned through visual analysis should help users in developing rationales, hypotheses, and insights into the dynamic nature of RNA. Sequoia is available at
- NSF-PAR ID:
- 10272949
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- BMC Genomics
- Volume:
- 22
- Issue:
- 1
- ISSN:
- 1471-2164
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4$$\times$$ faster decompression with 20 threads). NanoSpring is available on GitHub at$$\times$$ https://github.com/qm2/NanoSpring . -
Abstract Background The pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species.
Results Here we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations.
Conclusions The PGV software can be installed via conda or downloaded from
https://github.com/ucrbioinfo/PGV . The companion PGV browser athttp://pgv.cs.ucr.edu can be tested using example bed tracks available from the GitHub page. -
Abstract Background Differential correlation networks are increasingly used to delineate changes in interactions among biomolecules. They characterize differences between omics networks under two different conditions, and can be used to delineate mechanisms of disease initiation and progression.
Results We present a new R package, , that facilitates the estimation and visualization of differential correlation networks using multiple correlation measures and inference methods. The software is implemented in , and , and is available at
https://github.com/sqyu/CorDiffViz . Visualization has been tested for the Chrome and Firefox web browsers. A demo is available athttps://diffcornet.github.io/CorDiffViz/demo.html .Conclusions Our software offers considerable flexibility by allowing the user to interact with the visualization and choose from different estimation methods and visualizations. It also allows the user to easily toggle between correlation networks for samples under one condition and differential correlations between samples under two conditions. Moreover, the software facilitates integrative analysis of cross-correlation networks between two omics data sets.
-
Abstract Background Fusion of RNA-binding proteins (RBPs) to RNA base-editing enzymes (such as APOBEC1 or ADAR) has emerged as a powerful tool for the discovery of RBP binding sites. However, current methods that analyze sequencing data from RNA-base editing experiments are vulnerable to false positives due to off-target editing, genetic variation and sequencing errors.
Results We present FLagging Areas of RNA-editing Enrichment (FLARE), a Snakemake-based pipeline that builds on the outputs of the SAILOR edit site discovery tool to identify regions statistically enriched for RNA editing. FLARE can be configured to analyze any type of RNA editing, including C to U and A to I. We applied FLARE to C-to-U editing data from a RBFOX2-APOBEC1 STAMP experiment, to show that our approach attains high specificity for detecting RBFOX2 binding sites. We also applied FLARE to detect regions of exogenously introduced as well as endogenous A-to-I editing.
Conclusions FLARE is a fast and flexible workflow that identifies significantly edited regions from RNA-seq data. The FLARE codebase is available at
https://github.com/YeoLab/FLARE . -
Abstract Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that experienced substantial rearrangements. Most plastomes contain a large inverted repeat and two single‐copy regions, and a few dispersed repeats; however, the plastomes of some taxa harbour long repeat sequences (>300 bp). These long repeats make it challenging to assemble complete plastomes using short‐read data, leading to misassemblies and consensus sequences with spurious rearrangements. Single‐molecule, long‐read sequencing has the potential to overcome these challenges, yet there is no consensus on the most effective method for accurately assembling plastomes using long‐read data. We generated a pipeline,
p last idG enomeA ssemblyU singL ong‐read data (ptGAUL), to address the problem of plastome assembly using long‐read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long‐read data sets. We showed that ptGAUL quickly produces accurate and unbiased assemblies using only ~50× coverage of plastome data. Additionally, we deployed ptGAUL to assemble four newJuncus (Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements inJuncus plastomes compared with basal lineages of Poales. The ptGAUL pipeline is available on GitHub:https://github.com/Bean061/ptgaul .