Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
Abstract Background Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. Results GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Conclusions Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.more » « less
Abstract Gene co-expression networks (GCNs) provide multiple benefits to molecular research including hypothesis generation and biomarker discovery. Transcriptome profiles serve as input for GCN construction and are derived from increasingly larger studies with samples across multiple experimental conditions, treatments, time points, genotypes, etc. Such experiments with larger numbers of variables confound discovery of true network edges, exclude edges and inhibit discovery of context (or condition) specific network edges. To demonstrate this problem, a 475-sample dataset is used to show that up to 97% of GCN edges can be misleading because correlations are false or incorrect. False and incorrect correlations can occur when tests are applied without ensuring assumptions are met, and pairwise gene expression may not meet test assumptions if the expression of at least one gene in the pairwise comparison is a function of multiple confounding variables. The ‘one-size-fits-all’ approach to GCN construction is therefore problematic for large, multivariable datasets. Recently, the Knowledge Independent Network Construction toolkit has been used in multiple studies to provide a dynamic approach to GCN construction that ensures statistical tests meet assumptions and confounding variables are addressed. Additionally, it can associate experimental context for each edge of the network resulting in context-specific GCNs (csGCNs). To help researchers recognize such challenges in GCN construction, and the creation of csGCNs, we provide a review of the workflow.more » « less