NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MDRepo—an open data warehouse for community-contributed molecular dynamics simulations of proteins

https://doi.org/10.1093/nar/gkae1109

Roy, Amitava; Ward, Ethan; Choi, Illyoung; Cosi, Michele; Edgin, Tony; Hughes, Travis_S; Islam, Md_Shafayet; Khan, Asif_M; Kolekar, Aakash; Rayl, Mariah; et al (November 2024, Nucleic Acids Research)

Abstract Molecular Dynamics (MD) simulation of biomolecules provides important insights into conformational changes and dynamic behavior, revealing critical information about folding and interactions with other molecules. The collection of simulations stored in computers across the world holds immense potential to serve as training data for future Machine Learning models that will transform the prediction of structure, dynamics, drug interactions, and more. Ideally, there should exist an open access repository that enables scientists to submit and store their MD simulations of proteins and protein-drug interactions, and to find, retrieve, analyze, and visualize simulations produced by others. However, despite the ubiquity of MD simulation in structural biology, no such repository exists; as a result, simulations are instead stored in scattered locations without uniform metadata or access protocols. Here, we introduce MDRepo, a robust infrastructure that provides a relatively simple process for standardized community contribution of simulations, activates common downstream analyses on stored data, and enables search, retrieval, and visualization of contributed data. MDRepo is built on top of the open-source CyVerse research cyber-infrastructure, and is capable of storing petabytes of simulations, while providing high bandwidth upload and download capabilities and laying a foundation for cloud-based access to its stored data.
more » « less
MDRepo – an open environment for data warehousing and knowledge discovery from molecular dynamics simulations

https://doi.org/10.1101/2024.07.11.602903

Roy, Amitava; Ward, Ethan; Choi, Illyoung; Cosi, Michele; Edgin, Tony; Hughes, Travis S; Islam, Md Shafayet; Khan, Asif M; Kolekar, Aakash; Rayl, Mariah; et al (July 2024, bioRxiv)

BackgroundMolecular Dynamics (MD) simulation of biomolecules provides important insights into conformational changes and dynamic behavior, revealing critical information about folding and interactions with other molecules. This enables advances in drug discovery and the design of therapeutic interventions. The collection of simulations stored in computers across the world holds immense potential to serve as training data for future Machine Learning models that will transform the prediction of structure, dynamics, drug interactions, and more. A needIdeally, there should exist an open access repository that enables scientists to submit and store their MD simulations of proteins and protein-drug interactions, and to find, retrieve, analyze, and visualize simulations produced by others. However, despite the ubiquity of MD simulation in structural biology, no such repository exists; as a result, simulations are instead stored in scattered locations without uniform metadata or access protocols. A solutionHere, we introduce MDRepo, a robust infrastructure that supports a relatively simple process for standardized community contribution of simulations, activates common downstream analyses on stored data, and enables search, retrieval, and visualization of contributed data. MDRepo is built on top of the open-source CyVerse research cyberinfrastructure, and is capable of storing petabytes of simulations, while providing high bandwidth upload and download capabilities and laying a foundation for cloud-based access to its stored data.
more » « less
Full Text Available
Cloud Computing for Research and Education Gets a Sweet Upgrade with CACAO

https://doi.org/10.1145/3569951.3597555

Skidmore, Edwin; Cosi, Michele; Swetnam, Tyson; Merchant, Nirav; Xu, Zhouyun; Choi, Illyoung; Davey, Sean; Frady, Jeremy; Wall, Mariah; Yung, Michelle (July 2023, ACM)
PhytoOracle: Scalable, modular phenomics data processing pipelines

https://doi.org/10.3389/fpls.2023.1112973

Gonzalez, Emmanuel M.; Zarei, Ariyan; Hendler, Nathanial; Simmons, Travis; Zarei, Arman; Demieville, Jeffrey; Strand, Robert; Rozzi, Bruno; Calleja, Sebastian; Ellingson, Holly; et al (March 2023, Frontiers in Plant Science)

As phenomics data volume and dimensionality increase due to advancements in sensor technology, there is an urgent need to develop and implement scalable data processing pipelines. Current phenomics data processing pipelines lack modularity, extensibility, and processing distribution across sensor modalities and phenotyping platforms. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines for processing large volumes of field phenomics RGB, thermal, PSII chlorophyll fluorescence 2D images, and 3D point clouds. PhytoOracle aims to ( i ) improve data processing efficiency; ( ii ) provide an extensible, reproducible computing framework; and ( iii ) enable data fusion of multi-modal phenomics data. PhytoOracle integrates open-source distributed computing frameworks for parallel processing on high-performance computing, cloud, and local computing environments. Each pipeline component is available as a standalone container, providing transferability, extensibility, and reproducibility. The PO pipeline extracts and associates individual plant traits across sensor modalities and collection time points, representing a unique multi-system approach to addressing the genotype-phenotype gap. To date, PO supports lettuce and sorghum phenotypic trait extraction, with a goal of widening the range of supported species in the future. At the maximum number of cores tested in this study (1,024 cores), PO processing times were: 235 minutes for 9,270 RGB images (140.7 GB), 235 minutes for 9,270 thermal images (5.4 GB), and 13 minutes for 39,678 PSII images (86.2 GB). These processing times represent end-to-end processing, from raw data to fully processed numerical phenotypic trait data. Repeatability values of 0.39-0.95 (bounding area), 0.81-0.95 (axis-aligned bounding volume), 0.79-0.94 (oriented bounding volume), 0.83-0.95 (plant height), and 0.81-0.95 (number of points) were observed in Field Scanalyzer data. We also show the ability of PO to process drone data with a repeatability of 0.55-0.95 (bounding area).
more » « less
Full Text Available
StarBLAST: a scalable BLAST+ solution for the classroom

https://doi.org/10.21105/jose.00102

Cosi, Michele; Forstedt, J.j.; Gonzalez, Emmanuel; Xu, Zhuoyun; Peri, Sateesh; Tuteja, Reetu; Blumberg, Kai; Campbell, Tanner; Merchant, Nirav; Lyons, Eric (April 2021, Journal of Open Source Education)
null (Ed.)
Full Text Available

Search for: All records