skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: RCPA: An Open‐Source R Package for Data Processing, Differential Analysis, Consensus Pathway Analysis, and Visualization
Abstract Identifying impacted pathways is important because it provides insights into the biology underlying conditions beyond the detection of differentially expressed genes. Because of the importance of such analysis, more than 100 pathway analysis methods have been developed thus far. Despite the availability of many methods, it is challenging for biomedical researchers to learn and properly perform pathway analysis. First, the sheer number of methods makes it challenging to learn and choose the correct method for a given experiment. Second, computational methods require users to be savvy with coding syntax, and comfortable with command‐line environments, areas that are unfamiliar to most life scientists. Third, as learning tools and computational methods are typically implemented only for a few species (i.e., human and some model organisms), it is difficult to perform pathway analysis on other species that are not included in many of the current pathway analysis tools. Finally, existing pathway tools do not allow researchers to combine, compare, and contrast the results of different methods and experiments for both hypothesis testing and analysis purposes. To address these challenges, we developed an open‐source R package for Consensus Pathway Analysis (RCPA) that allows researchers to conveniently: (1) download and process data from NCBI GEO; (2) perform differential analysis using established techniques developed for both microarray and sequencing data; (3) perform both gene set enrichment, as well as topology‐based pathway analysis using different methods that seek to answer different research hypotheses; (4) combine methods and datasets to find consensus results; and (5) visualize analysis results and explore significantly impacted pathways across multiple analyses. This protocol provides many example code snippets with detailed explanations and supports the analysis of more than 1000 species, two pathway databases, three differential analysis techniques, eight pathway analysis tools, six meta‐analysis methods, and two consensus analysis techniques. The package is freely available on the CRAN repository. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Processing Affymetrix microarrays Basic Protocol 2: Processing Agilent microarrays Support Protocol: Processing RNA sequencing (RNA‐Seq) data Basic Protocol 3: Differential analysis of microarray data (Affymetrix and Agilent) Basic Protocol 4: Differential analysis of RNA‐Seq data Basic Protocol 5: Gene set enrichment analysis Basic Protocol 6: Topology‐based (TB) pathway analysis Basic Protocol 7: Data integration and visualization  more » « less
Award ID(s):
2343019 2203236
PAR ID:
10505348
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Current Protocols
Volume:
4
Issue:
5
ISSN:
2691-1299
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract This manuscript describes the development of a resource module that is part of a learning platform named ‘NIGMS Sandbox for Cloud-based Learning’ (https://github.com/NIGMS/NIGMS-Sandbox). The module delivers learning materials on Cloud-based Consensus Pathway Analysis in an interactive format that uses appropriate cloud resources for data access and analyses. Pathway analysis is important because it allows us to gain insights into biological mechanisms underlying conditions. But the availability of many pathway analysis methods, the requirement of coding skills, and the focus of current tools on only a few species all make it very difficult for biomedical researchers to self-learn and perform pathway analysis efficiently. Furthermore, there is a lack of tools that allow researchers to compare analysis results obtained from different experiments and different analysis methods to find consensus results. To address these challenges, we have designed a cloud-based, self-learning module that provides consensus results among established, state-of-the-art pathway analysis techniques to provide students and researchers with necessary training and example materials. The training module consists of five Jupyter Notebooks that provide complete tutorials for the following tasks: (i) process expression data, (ii) perform differential analysis, visualize and compare the results obtained from four differential analysis methods (limma, t-test, edgeR, DESeq2), (iii) process three pathway databases (GO, KEGG and Reactome), (iv) perform pathway analysis using eight methods (ORA, CAMERA, KS test, Wilcoxon test, FGSEA, GSA, SAFE and PADOG) and (v) combine results of multiple analyses. We also provide examples, source code, explanations and instructional videos for trainees to complete each Jupyter Notebook. The module supports the analysis for many model (e.g. human, mouse, fruit fly, zebra fish) and non-model species. The module is publicly available at https://github.com/NIGMS/Consensus-Pathway-Analysis-in-the-Cloud. This manuscript describes the development of a resource module that is part of a learning platform named ``NIGMS Sandbox for Cloud-based Learning'' https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox [1] at the beginning of this Supplement. This module delivers learning materials on the analysis of bulk and single-cell ATAC-seq data in an interactive format that uses appropriate cloud resources for data access and analyses. 
    more » « less
  2. Abstract Background Recent development of bioinformatics tools for Next Generation Sequencing data has facilitated complex analyses and prompted large scale experimental designs for comparative genomics. When combined with the advances in network inference tools, this can lead to powerful methodologies for mining genomics data, allowing development of pipelines that stretch from sequence reads mapping to network inference. However, integrating various methods and tools available over different platforms requires a programmatic framework to fully exploit their analytic capabilities. Integrating multiple genomic analysis tools faces challenges from standardization of input and output formats, normalization of results for performing comparative analyses, to developing intuitive and easy to control scripts and interfaces for the genomic analysis pipeline. Results We describe here NetSeekR, a network analysis R package that includes the capacity to analyze time series of RNA-Seq data, to perform correlation and regulatory network inferences and to use network analysis methods to summarize the results of a comparative genomics study. The software pipeline includes alignment of reads, differential gene expression analysis, correlation network analysis, regulatory network analysis, gene ontology enrichment analysis and network visualization of differentially expressed genes. The implementation provides support for multiple RNA-Seq read mapping methods and allows comparative analysis of the results obtained by different bioinformatics methods. Conclusion Our methodology increases the level of integration of genomics data analysis tools to network inference, facilitating hypothesis building, functional analysis and genomics discovery from large scale NGS data. When combined with network analysis and simulation tools, the pipeline allows for developing systems biology methods using large scale genomics data. 
    more » « less
  3. Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies. 
    more » « less
  4. RNA sequencing (RNA-seq) has emerged as a prominent resource for transcriptomic analysis due to its ability to measure gene expression in a highly sensitive and accurate manner. With the increasing availability of RNA-seq data analysis from clinical studies and patient samples, the development of effective visualization tools for RNA-seq analysis has become increasingly important to help clinicians and biomedical researchers better understand the complex patterns of gene expression associated with health and disease. This review aims to outline the current state-of-the-art data visualization techniques and tools commonly used to frame clinical inferences from RNA-seq data and point out their benefits, applications, and limitations. A systematic review of English articles using PubMed, Scopus, Web of Science, and IEEE Xplore databases was performed. Search terms included “RNA-seq”, “visualization”, “plots”, and “clinical”. Only full-text studies reported between 2017 and 2024 were included for analysis. Following PRISMA guidelines, a total of 126 studies were identified, of which 33 studies met the inclusion criteria. We found that 18% of studies have visualization techniques and tools for circular RNA-seq data, 56% for single-cell RNA-seq data, 23% for bulk RNA-seq data, and 3% for long non-coding RNA-seq data. Overall, this review provides a comprehensive overview of the common visualization tools and their potential applications, which is a useful resource for researchers and clinicians interested in using RNA-seq data for various clinical purposes (e.g., diagnosis or prognosis). 
    more » « less
  5. Abstract Promoters and the noncoding sequences that drive their function are fundamental aspects of genes that are critical to their regulation. The transcription preinitiation complex binds and assembles on promoters where it facilitates transcription. The transcription start site (TSS) is located downstream of the promoter sequence and is defined as the location in the genome where polymerase begins transcribing DNA into RNA. Knowing the location of TSSs is useful for annotation of genes, identification of non‐coding sequences important to gene regulation, detection of alternative TSSs, and understanding of 5′ UTR content. Several existing techniques make it possible to accurately identify TSSs, but are often difficult to perform experimentally, require large amounts of input RNA, or are unable to identify a large number of TSSs from a single sample. Many of these protocols take advantage of template switching reverse transcriptases (TSRTs), which reliably place an adaptor at the 5′ end of a first strand synthesis of cDNA. Here, we introduce a protocol that exploits TSRT activity combined with rolling circle amplification to identify TSSs with several unique advantages over existing methods. Sequence adaptors are placed on the 5′ and 3′ end of the full‐length cDNA copy of a transcript. A splint compatible with those adaptors is then used to circularize the full‐length cDNA. Linear DNA containing concatemers of the cDNA are generated using rolling circle amplification, and a sequencing library is formed by fragmenting the concatemers. This protocol is straightforward to execute, requiring limited bench time with relatively stable reagents. Using extremely low amounts of RNA input, this protocol produces large numbers of accurate, deduplicated TSSs genome wide. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Splint generation Basic Protocol 2: RNA extraction Basic Protocol 3: cDNA synthesis Basic Protocol 4: cDNA circularization and amplification Basic Protocol 5: Library generation 
    more » « less