Smith, Daniel G. A.; Altarawy, Doaa; Burns, Lori A.; Welborn, Matthew; Naden, Levi N.; Ward, Logan; Ellis, Sam; Pritchard, Benjamin P.; Crawford, T. Daniel(
, WIREs Computational Molecular Science)
Abstract
The Molecular Sciences Software Institute's (MolSSI) Quantum Chemistry Archive (QCArchive) project is an umbrella name that covers both a central server hosted by MolSSI for community data and the Python‐based software infrastructure that powers automated computation and storage of quantum chemistry (QC) results. The MolSSI‐hosted central server provides the computational molecular sciences community a location to freely access tens of millions of QC computations for machine learning, methodology assessment, force‐field fitting, and more through a Python interface. Facile, user‐friendly mining of the centrally archived quantum chemical data also can be achieved through web applications found athttps://qcarchive.molssi.org. The software infrastructure can be used as a standalone platform to compute, structure, and distribute hundreds of millions of QC computations for individuals or groups of researchers at any scale. The QCArchiveInfrastructureis open‐source (BSD‐3C), code repositories can be found athttps://github.com/MolSSI, and releases can be downloaded via PyPI and Conda.
This article is categorized under:
Electronic Structure Theory > Ab Initio Electronic Structure Methods
Software > Quantum Chemistry
Data Science > Computer Algorithms and Programming
Big graphs like social network user interactions and customer rating matrices require significant computing resources to maintain. Data owners are now using public cloud resources for storage and computing elasticity. However, existing solutions do not fully address the privacy and ownership protection needs of the key involved parties: data contributors and the data owner who collects data from contributors.
Methods
We propose a Trusted Execution Environment (TEE) based solution: TEE-Graph for graph spectral analysis of outsourced graphs in the cloud. TEEs are new CPU features that can enable much more efficient confidential computing solutions than traditional software-based cryptographic ones. Our approach has several unique contributions compared to existing confidential graph analysis approaches. (1) It utilizes the unique TEE properties to ensure contributors' new privacy needs, e.g., the right of revocation for shared data. (2) It implements efficient access-pattern protection with a differentially private data encoding method. And (3) it implements TEE-based special analysis algorithms: the Lanczos method and the Nystrom method for efficiently handling big graphs and protecting confidentiality from compromised cloud providers.
Results
The TEE-Graph approach is much more efficient than software crypto approaches and also immune to access-pattern-based attacks. Compared with the best-known software crypto approach for graph spectral analysis, PrivateGraph, we have seen that TEE-Graph has 103−105times lower computation, storage, and communication costs. Furthermore, the proposed access-pattern protection method incurs only about 10%-25% of the overall computation cost.
Discussion
Our experimentation showed that TEE-Graph performs significantly better and has lower costs than typical software approaches. It also addresses the unique ownership and access-pattern issues that other TEE-related graph analytics approaches have not sufficiently studied. The proposed approach can be extended to other graph analytics problems with strong ownership and access-pattern protection.
Kitchen, S. A.; Von Kuster, G.; Kuntz, K. L. Vasquez; Reich, H. G.; Miller, W.; Griffin, S.; Fogarty, Nicole D.; Baums, I. B.(
, Scientific Reports)
Abstract
Standardized identification of genotypes is necessary in animals that reproduce asexually and form large clonal populations such as coral. We developed a high-resolution hybridization-based genotype array coupled with an analysis workflow and database for the most speciose genus of coral,Acropora, and their symbionts. We designed the array to co-analyze host and symbionts based on bi-allelic single nucleotide polymorphisms (SNP) markers identified from genomic data of the two CaribbeanAcroporaspecies as well as their dominant dinoflagellate symbiont,Symbiodinium ‘fitti’.SNPs were selected to resolve multi-locus genotypes of host (called genets) and symbionts (called strains), distinguish host populations and determine ancestry of coral hybrids between Caribbean acroporids. Pacific acroporids can also be genotyped using a subset of the SNP loci and additional markers enable the detection of symbionts belonging to the generaBreviolum, Cladocopium, andDurusdinium. Analytic tools to produce multi-locus genotypes of hosts based on these SNP markers were combined in a workflow called theStandardTools forAcroporidGenotyping (STAG). The STAG workflow and database are contained within a customized Galaxy environment (https://coralsnp.science.psu.edu/galaxy/), which allows for consistent identification of host genet and symbiont strains and serves as a template for the development of arrays for additional coral genera. STAG data can be used to track temporal and spatial changes of sampled genets necessary for restoration planning and can be applied to downstream genomic analyses. Using STAG, we uncover bi-directional hybridization between and population structure within Caribbean acroporids and detect a cryptic Acroporid species in the Pacific.
Hadish, John A.; Biggs, Tyler D.; Shealy, Benjamin T.; Bender, M. Reed; McKnight, Coleman B.; Wytko, Connor; Smith, Melissa C.; Feltus, F. Alex; Honaas, Loren; Ficklin, Stephen P.(
, BMC Bioinformatics)
Abstract Background Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. Results GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Conclusions Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.
Li, Feng; Chen, Ranran; Fu, Yuankun; Song, Fengguang; Liang, Yao; Ranawaka, Isuru; Pamidighantam, Sudhakar; Luna, Daniel; Liang, Xu(
, The 17th IEEE International Conference on eScience)
Workflow management systems (WMSs) are commonly
used to organize/automate sequences of tasks as workflows
to accelerate scientific discoveries. During complex workflow
modeling, a local interactive workflow environment is desirable,
as users usually rely on their rich, local environments for fast
prototyping and refinements before they consider using more
powerful computing resources. However, existing WMSs do not
simultaneously support local interactive workflow environments
and HPC resources. In this paper, we present an on-demand
access mechanism to remote HPC resources from desktop/laptopbased
workflow management software to compose, monitor and
analyze scientific workflows in the CyberWater project. Cyber-
Water is an open-data and open-modeling software framework
for environmental and water communities. In this work, we
extend the open-model, open-data design of CyberWater with
on-demand HPC accessing capacity. In particular, we design and
implement the LaunchAgent library, which can be integrated
into the local desktop environment to allow on-demand usage
of remote resources for hydrology-related workflows. LaunchAgent
manages authentication to remote resources, prepares the
computationally-intensive or data-intensive tasks as batch jobs,
submits jobs to remote resources, and monitors the quality of
services for the users. LaunchAgent interacts seamlessly with
other existing components in CyberWater, which is now able
to provide advantages of both feature-rich desktop software
experience and increased computation power through on-demand
HPC/Cloud usage. In our evaluations, we demonstrate how
a hydrology workflow that consists of both local and remote
tasks can be constructed and show that the added on-demand
HPC/Cloud usage helps speeding up hydrology workflows while
allowing intuitive workflow configurations and execution using a
desktop graphical user interface.
Lushbough, Carol M., Gnimpieba, Etienne Z., and Dooley, Rion. Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure. Concurrency and Computation: Practice and Experience 27.2 Web. doi:10.1002/cpe.3237.
Lushbough, Carol M., Gnimpieba, Etienne Z., & Dooley, Rion. Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure. Concurrency and Computation: Practice and Experience, 27 (2). https://doi.org/10.1002/cpe.3237
Lushbough, Carol M., Gnimpieba, Etienne Z., and Dooley, Rion.
"Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure". Concurrency and Computation: Practice and Experience 27 (2). Country unknown/Code not available: Wiley Blackwell (John Wiley & Sons). https://doi.org/10.1002/cpe.3237.https://par.nsf.gov/biblio/10196597.
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.