skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: VariantFoldRNA: a flexible, containerized, and scalable pipeline for genome-wide riboSNitch prediction
Abstract Single nucleotide polymorphisms (SNPs) can alter RNA structure by changing the proportions of existing conformations or leading to new conformations in the structural ensemble. Such structure-changing SNPs, or riboSNitches, have been associated with diseases in humans and climate adaptation in plants. While several computational tools are available for predicting whether an SNP is a riboSNitch, these tools were generally developed to analyze individual RNAs and are not optimized for genome-wide analyses. To fill this gap, we developed VariantFoldRNA, a flexible, containerized, and automated pipeline for genome-wide prediction of riboSNitches. Our pipeline automatically installs all dependencies, can be run locally or on high-performance clusters, and is modular, enabling the user to customize the analysis for the research question of interest. VariantFoldRNA can predict riboSNitches genome-wide at user-specified temperatures and splicing conditions, opening the door to novel analyses. The pipeline is an open-source command-line tool that is freely available at https://github.com/The-Bevilacqua-Lab/variantfoldrna.  more » « less
Award ID(s):
2122357
PAR ID:
10594489
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
NAR Genomics and Bioinformatics
Volume:
7
Issue:
2
ISSN:
2631-9268
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Luigi Martelli, Pier (Ed.)
    Abstract Motivation Advanced publicly available sequencing data from large populations have enabled informative genome-wide association studies (GWAS) that associate SNPs with phenotypic traits of interest. Many publicly available tools able to perform GWAS have been developed in response to increased demand. However, these tools lack a comprehensive pipeline that includes both pre-GWAS analysis, such as outlier removal, data transformation and calculation of Best Linear Unbiased Predictions or Best Linear Unbiased Estimates. In addition, post-GWAS analysis, such as haploblock analysis and candidate gene identification, is lacking. Results Here, we present Holistic Analysis with Pre- and Post-Integration (HAPPI) GWAS, an open-source GWAS tool able to perform pre-GWAS, GWAS and post-GWAS analysis in an automated pipeline using the command-line interface. Availability and implementation HAPPI GWAS is written in R for any Unix-like operating systems and is available on GitHub (https://github.com/Angelovici-Lab/HAPPI.GWAS.git). Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. ABSTRACT For most species, transcriptome data are much more readily available than genome data. Without a reference genome, gene calling is cumbersome and inaccurate because of the high degree of redundancy in de novo transcriptome assemblies. To simplify and increase the accuracy of de novo transcriptome assembly in the absence of a reference genome, we developed UnigeneFinder. Combining several clustering methods, UnigeneFinder substantially reduces the redundancy typical of raw transcriptome assemblies. This pipeline offers an effective solution to the problem of inflated transcript numbers, achieving a closer representation of the actual underlying genome. UnigeneFinder performs comparably or better, compared with existing tools, on plant species with varying genome complexities. UnigeneFinder is the only available transcriptome redundancy solution that fully automates the generation of primary transcript, coding region, and protein sequences, analogous to those available for high‐quality reference genomes. These features, coupled with the pipeline’s cross‐platform implementation, focus on automation, and an accessible, user‐friendly interface, make UnigeneFinder a useful tool for many downstream sequence‐based analyses in nonmodel organisms lacking a reference genome, including differential gene expression analysis, accurate ortholog identification, functional enrichments, and evolutionary analyses. UnigeneFinder also runs efficiently both on high‐performance computing (HPC) systems and personal computers, further reducing barriers to use. 
    more » « less
  3. Abstract Understanding how genetic diversity is distributed across spatiotemporal scales in species of conservation or management concern is critical for identifying large‐scale mechanisms affecting local conservation status and implementing large‐scale biodiversity monitoring programmes. However, cross‐scale surveys of genetic diversity are often impractical within single studies, and combining datasets to increase spatiotemporal coverage is frequently impeded by using different sets of molecular markers. Recently developed molecular tools make surveys based on standardized single‐nucleotide polymorphism (SNP) panels more feasible than ever, but require existing genomic information. Here, we conduct the first survey of genome‐wide SNPs across the native range of brook trout (Salvelinus fontinalis), a cold‐adapted species that has been the focus of considerable conservation and management effort across eastern North America. Our dataset can be leveraged to easily design SNP panels that allow datasets to be combined for large‐scale analyses. We performed restriction site‐associated DNA sequencing for wild brook trout from 82 locations spanning much of the native range and domestic brook trout from 24 hatchery strains used in stocking efforts. We identified over 24,000 SNPs distributed throughout the brook trout genome. We explored the ability of these SNPs to resolve relationships across spatial scales, including population structure and hatchery admixture. Our dataset captures a wide spectrum of genetic diversity in native brook trout, offering a valuable resource for developing SNP panels. We highlight potential applications of this resource with the goal of increasing the integration of genomic information into decision‐making for brook trout and other species of conservation or management concern. 
    more » « less
  4. Abstract MotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens. ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log⁡3N, log⁡3M)).We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min. Availability and implementationThe RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg. 
    more » « less
  5. Abstract Species living in changing environments require the acclimatization of individual organisms, which may be significantly influenced by allele specific expression (ASE). Data from RNA-seq experiments can be used to identify and quantify the expressed alleles. However, conventional allele matching to the reference genome creates a mapping bias towards the reference allele that prevents a reliable estimation of the allele counts. We developed a pipeline that allows identification and unbiased quantification of the alleles corresponding to an RNA-seq dataset, without any previous knowledge of the haplotype. To achieve the unbiased mapping, we generate two pseudogenomes by substituting the alternative alleles on the reference genome. The SNPs are further called against each pseudogenome, providing two SNP data-sets that are averaged for calculation of the allele depth to be merged in a final SNP calling file. The pipeline presented here can calculate ASE in non-model organisms and can be applied to previous RNA-seq data-sets for expanding studies in gene expression regulation. 
    more » « less