Abstract This article introduces a general processing framework to effectively utilize waveform data stored on modern cloud platforms. The focus is hybrid processing schemes for which a local system drives processing. We show that downloading files and doing all processing locally is problematic even when the local system is a high-performance computing (HPC) cluster. Benchmark tests with parallel processing show that approach always creates a bottleneck as the volume of data being handled increases with more processes pulling data. We find a hybrid model for which processing to reduce the volume of data transferred from the cloud servers to the local system can dramatically improve processing time. Tests implemented with the Massively Parallel Analysis System for Seismology (MsPASS) utilizing Amazon Web Service’s (AWS) Lambda service yield throughput comparable to processing day files on a local HPC file system. Given the ongoing migration of seismology data to cloud storage, our results show doing some or all processing on the cloud will be essential for any processing involving large volumes of data.
more »
« less
Training Data for the NeonTreeEvaluation Benchmark
This dataset is the large training data files for the NeonTreeEvaluation Benchmark for individual tree detection from airborne imagery. For each geographic site, given by the NEON four letter code (e.g HARV -> Harvard Forest), there are three files: a RGB image, a LiDAR tile, and a 426 band hyperpspectral file. For more information on the benchmark, and the corresponding R package, see https://github.com/weecology/NeonTreeEvaluation_package</p>
more »
« less
- Award ID(s):
- 1926542
- PAR ID:
- 10453017
- Publisher / Repository:
- Zenodo
- Date Published:
- Edition / Version:
- 0.1
- Subject(s) / Keyword(s):
- trees machine learning LiDAR Hyperspectral RGB object detection
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Variability-aware analysis is critical for ensuring the quality of configurable C software. An important step toward the development of variability-aware analysis at scale is to transform real-world C software that uses both C and preprocessor into pure C code, by replacing the preprocessor's compile-time variability with C's runtime-variability. In this work, we design and implement a desugaring tool, SugarC, that transforms away real-world preprocessor usage. SugarC augments C's formal grammar specification with translation rules, performs simultaneous type checking during desugaring, and introduces numerous optimizations to address challenges that appear in real-world preprocessor usage. The experiments on DesugarBench, a benchmark consisting of 108 manually-created programs, show that SugarC supports many more language features than two existing desugaring tools. When applied on three real-world configurable C software, SugarC desugared 774 out of 813 files in the three programs, taking at most ten minutes in the worst case and less than two minutes for 95% of the C files.more » « less
-
The InterPlanetary File System (IPFS) has recently gained considerable attention. While prior research has focused on understanding its performance characterization and application support, it remains unclear: (1) what kind of files/content are stored in IPFS, (2) who are providing these files, (3) are these files always accessible, and (4) what affects the file access performance. To answer these questions, in this paper, we perform measurement and analysis on over 4 million files associated with CIDs (content IDs) that appeared in publicly available IPFS datasets. Our results reveal the following key findings: (1) Mixed file accessibility: while IPFS is not designed for a permanent storage, accessing a non-trivial portion of files, such as those of NFTs and video streams, often requires multiple retrieval attempts, potentially blocking NFT transactions and negatively affecting the user experience. (2) Dominance of NFT (non-fungible token) and video files: about 50% of stored files are NFT-related, followed by a large portion of video files, among which about half are pirated movies and adult content. (3) Centralization of content providers: a small number of peers (top-50), mostly cloud nodes hosted by tech companies, serve a large portion (95%) of files, deviating from IPFS's intended design goal. (4) High variation of downloading throughput and lookup time: large file retrievals experience lower average throughput due to more overhead for resolving file chunk CIDs, and looking up files hosted by non-cloud nodes takes longer. We hope that our findings can offer valuable insights for (1) IPFS application developers to take into consideration these characteristics when building applications on top of IPFS, and (2) IPFS system developers to improve IPFS and similar systems to be developed for Web3.more » « less
-
{"Abstract":["This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.<\/p>\n\nVersion: 2.0<\/p>\n\nChanges versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.<\/p>\n\n\nConventions Used in These Files\n===============================<\/p>\n\nStructure Files\n----------------\n- graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass), as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)<\/p>\n\n- graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)\n- repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)\n- freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)\n- amber_*.prmtop (combined topology and parameter files for Amber force field simulations)\n- repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)<\/p>\n\nForce Field Parameters\n----------------------\nCHARMM format parameter files:\n- par_all36m_prot.prm (CHARMM36m FF for proteins)\n- par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.\n- toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)<\/p>\n\nTemplate NAMD Configuration Files\n---------------------------------\nThese contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):\n- template_min.namd (minimization)\n- template_eq.namd (NPT equilibration with lower graphene fixed)\n- template_abf.namd (for adaptive biasing force)<\/p>\n\nMinimization\n-------------\n- namd/min_*.0.namd<\/p>\n\nEquilibration\n-------------\n- namd/eq_*.0.namd<\/p>\n\nAdaptive biasing force calculations\n-----------------------------------\n- namd/eabfZRest7_graph_chp1404.0.namd\n- namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)<\/p>\n\nLog Files\n---------\nFor each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.<\/p>\n\nSimulation Output\n-----------------\nThe simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.<\/p>\n\nScripts\n-------\nFiles with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.<\/p>\n\n\nCONTENTS\n========<\/p>\n\nThe directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.<\/p>\n\nSim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite\u2013water interface.<\/p>\n\nSim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite\u2013water interface.<\/p>\n\nSim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.<\/p>\n\nSim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 295 K.<\/p>\n\nSim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.<\/p>\n\nSim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).<\/p>\n\nSim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.<\/p>\n\nSim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene\u2013water interface in the initial configuration.<\/p>\n\nSim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.<\/p>\n\n <\/p>"],"Other":["This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030."]}more » « less
-
Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Prescreening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.more » « less
An official website of the United States government
