skip to main content


Title: Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure
Summary

In order to handle the vast quantities of biological data gener6ated by high‐throughput experimental technologies, the BioExtract Server (bioextract.org) has leveraged iPlant Collaborative (www.iplantcollaborative.org) functionality to help address big data storage and analysis issues in the bioinformatics field. The BioExtract Server is a Web‐based, workflow‐enabling system that offers researchers a flexible environment for analyzing genomic data. It provides researchers with the ability to save a series of BioExtract Server tasks (e.g., query a data source, save a data extract, and execute an analytic tool) as a workflow and the opportunity for researchers to share their data extracts, analytic tools, and workflows with collaborators. The iPlant Collaborative is a community of researchers, educators, and students working to enrich science through the development of cyberinfrastructure—the physical computing resources, collaborative environment, virtual machine resources, and interoperable analysis software and data services—that are essential components of modern biology. The iPlant AGAVE Advanced Programming Interface, developed through the iPlant Collaborative, is a hosted, Software‐as‐a‐Service resource providing access to a collection of high performance computing and cloud resources. Leveraging AGAVE, the BioExtract Server gives researchers easy access to multiple high performance computers and delivers computation and storage as dynamically allocated resources via the Internet. © 2014 The Authors.Concurrency and Computation: Practice and Experiencepublished by John Wiley & Sons Ltd.

 
more » « less
NSF-PAR ID:
10196597
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Concurrency and Computation: Practice and Experience
Volume:
27
Issue:
2
ISSN:
1532-0626
Page Range / eLocation ID:
p. 408-419
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The Molecular Sciences Software Institute's (MolSSI) Quantum Chemistry Archive (QCArchive) project is an umbrella name that covers both a central server hosted by MolSSI for community data and the Python‐based software infrastructure that powers automated computation and storage of quantum chemistry (QC) results. The MolSSI‐hosted central server provides the computational molecular sciences community a location to freely access tens of millions of QC computations for machine learning, methodology assessment, force‐field fitting, and more through a Python interface. Facile, user‐friendly mining of the centrally archived quantum chemical data also can be achieved through web applications found athttps://qcarchive.molssi.org. The software infrastructure can be used as a standalone platform to compute, structure, and distribute hundreds of millions of QC computations for individuals or groups of researchers at any scale. The QCArchiveInfrastructureis open‐source (BSD‐3C), code repositories can be found athttps://github.com/MolSSI, and releases can be downloaded via PyPI and Conda.

    This article is categorized under:

    Electronic Structure Theory > Ab Initio Electronic Structure Methods

    Software > Quantum Chemistry

    Data Science > Computer Algorithms and Programming

     
    more » « less
  2. Introduction

    Big graphs like social network user interactions and customer rating matrices require significant computing resources to maintain. Data owners are now using public cloud resources for storage and computing elasticity. However, existing solutions do not fully address the privacy and ownership protection needs of the key involved parties: data contributors and the data owner who collects data from contributors.

    Methods

    We propose a Trusted Execution Environment (TEE) based solution: TEE-Graph for graph spectral analysis of outsourced graphs in the cloud. TEEs are new CPU features that can enable much more efficient confidential computing solutions than traditional software-based cryptographic ones. Our approach has several unique contributions compared to existing confidential graph analysis approaches. (1) It utilizes the unique TEE properties to ensure contributors' new privacy needs, e.g., the right of revocation for shared data. (2) It implements efficient access-pattern protection with a differentially private data encoding method. And (3) it implements TEE-based special analysis algorithms: the Lanczos method and the Nystrom method for efficiently handling big graphs and protecting confidentiality from compromised cloud providers.

    Results

    The TEE-Graph approach is much more efficient than software crypto approaches and also immune to access-pattern-based attacks. Compared with the best-known software crypto approach for graph spectral analysis, PrivateGraph, we have seen that TEE-Graph has 103−105times lower computation, storage, and communication costs. Furthermore, the proposed access-pattern protection method incurs only about 10%-25% of the overall computation cost.

    Discussion

    Our experimentation showed that TEE-Graph performs significantly better and has lower costs than typical software approaches. It also addresses the unique ownership and access-pattern issues that other TEE-related graph analytics approaches have not sufficiently studied. The proposed approach can be extended to other graph analytics problems with strong ownership and access-pattern protection.

     
    more » « less
  3. Abstract

    Standardized identification of genotypes is necessary in animals that reproduce asexually and form large clonal populations such as coral. We developed a high-resolution hybridization-based genotype array coupled with an analysis workflow and database for the most speciose genus of coral,Acropora, and their symbionts. We designed the array to co-analyze host and symbionts based on bi-allelic single nucleotide polymorphisms (SNP) markers identified from genomic data of the two CaribbeanAcroporaspecies as well as their dominant dinoflagellate symbiont,Symbiodinium ‘fitti’.SNPs were selected to resolve multi-locus genotypes of host (called genets) and symbionts (called strains), distinguish host populations and determine ancestry of coral hybrids between Caribbean acroporids. Pacific acroporids can also be genotyped using a subset of the SNP loci and additional markers enable the detection of symbionts belonging to the generaBreviolum, Cladocopium, andDurusdinium. Analytic tools to produce multi-locus genotypes of hosts based on these SNP markers were combined in a workflow called theStandardTools forAcroporidGenotyping (STAG). The STAG workflow and database are contained within a customized Galaxy environment (https://coralsnp.science.psu.edu/galaxy/), which allows for consistent identification of host genet and symbiont strains and serves as a template for the development of arrays for additional coral genera. STAG data can be used to track temporal and spatial changes of sampled genets necessary for restoration planning and can be applied to downstream genomic analyses. Using STAG, we uncover bi-directional hybridization between and population structure within Caribbean acroporids and detect a cryptic Acroporid species in the Pacific.

     
    more » « less
  4. Abstract Background Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. Results GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Conclusions Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions. 
    more » « less
  5. Workflow management systems (WMSs) are commonly used to organize/automate sequences of tasks as workflows to accelerate scientific discoveries. During complex workflow modeling, a local interactive workflow environment is desirable, as users usually rely on their rich, local environments for fast prototyping and refinements before they consider using more powerful computing resources. However, existing WMSs do not simultaneously support local interactive workflow environments and HPC resources. In this paper, we present an on-demand access mechanism to remote HPC resources from desktop/laptopbased workflow management software to compose, monitor and analyze scientific workflows in the CyberWater project. Cyber- Water is an open-data and open-modeling software framework for environmental and water communities. In this work, we extend the open-model, open-data design of CyberWater with on-demand HPC accessing capacity. In particular, we design and implement the LaunchAgent library, which can be integrated into the local desktop environment to allow on-demand usage of remote resources for hydrology-related workflows. LaunchAgent manages authentication to remote resources, prepares the computationally-intensive or data-intensive tasks as batch jobs, submits jobs to remote resources, and monitors the quality of services for the users. LaunchAgent interacts seamlessly with other existing components in CyberWater, which is now able to provide advantages of both feature-rich desktop software experience and increased computation power through on-demand HPC/Cloud usage. In our evaluations, we demonstrate how a hydrology workflow that consists of both local and remote tasks can be constructed and show that the added on-demand HPC/Cloud usage helps speeding up hydrology workflows while allowing intuitive workflow configurations and execution using a desktop graphical user interface. 
    more » « less