skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.


Title: Simulation Files for Organic Contaminants and Atmospheric Nitrogen at the Graphene–Water Interface

This data set provides files needed to run the simulations described in the manuscript entitled "Organic contaminants and atmospheric nitrogen at the graphene–water interface: A simulation study" using the molecular dynamics software NAMD and LAMMPS. The output of the simulations, as well as scripts used to analyze this output, are also included. The files are organized into directories corresponding to the figures of the main text and supplementary information. They include molecular model structure files (NAMD psf), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD or LAMMPS configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and some trajectories in dcd format (downsampled). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts. A modified LAMMPS C++ source file is also included.

This material is based upon work supported by the National Science Foundation under Grant No. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) and Stampede2 at the Texas Advanced Computing Center through allocation BIO200030, which is supported by National Science Foundation grant number ACI-1548562. 
more » « less
Award ID(s):
1945589
NSF-PAR ID:
10343234
Author(s) / Creator(s):
; ;
Publisher / Repository:
Zenodo
Date Published:
Edition / Version:
1.0
Subject(s) / Keyword(s):
["molecular dynamics","NAMD","graphite","graphene","adsorption","contamination","alkanes","graphite-water interface","graphene-water interface","ABF","adaptive biasing force","Colvars","self-assembly","GCMC","grand canonical Monte Carlo","LAMMPS"]
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.

    Version: 2.0

    Changes versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.


    Conventions Used in These Files
    ===============================

    Structure Files
    ----------------
    - graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass), as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)

    - graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)
    - repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)
    - freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)
    - amber_*.prmtop (combined topology and parameter files for Amber force field simulations)
    - repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)

    Force Field Parameters
    ----------------------
    CHARMM format parameter files:
    - par_all36m_prot.prm (CHARMM36m FF for proteins)
    - par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.
    - toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)

    Template NAMD Configuration Files
    ---------------------------------
    These contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):
    - template_min.namd (minimization)
    - template_eq.namd (NPT equilibration with lower graphene fixed)
    - template_abf.namd (for adaptive biasing force)

    Minimization
    -------------
    - namd/min_*.0.namd

    Equilibration
    -------------
    - namd/eq_*.0.namd

    Adaptive biasing force calculations
    -----------------------------------
    - namd/eabfZRest7_graph_chp1404.0.namd
    - namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)

    Log Files
    ---------
    For each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.

    Simulation Output
    -----------------
    The simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.

    Scripts
    -------
    Files with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.


    CONTENTS
    ========

    The directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.

    Sim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite–water interface.

    Sim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite–water interface.

    Sim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.

    Sim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 295 K.

    Sim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.

    Sim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).

    Sim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.

    Sim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene–water interface in the initial configuration.

    Sim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite–water interface at 370 K.

    Sim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.

     

    This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030. 
    more » « less
  2. The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]). 

    DataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .  

    The archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve. 

    To facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2]. 

    To extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:

    cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '\n' | grep "hash://" | sort | uniq > eml-hashes.txt

    extract data.tar.gz using:

    ~/preston-archive$ tar xzf data.tar.gz 

    then use Preston to extract each hash using something like:

    ~/preston-archive$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa
    <eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1" packageId="doi:10.18739/A24P9Q" system="https://arcticdata.io" scope="system" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 ~/development/eml/eml.xsd">
      <dataset>
        <alternateIdentifier>urn:x-wmo:md:org.aoncadis.www::d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>
        <alternateIdentifier>d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>
        <title>Airglow Image Data 2011 4 of 5</title>
    ...

    Alternatively, without using Preston, you can extract the data using the naming convention:

    data/[x]/[y]/[z]/[hash]/data

    where x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.

    For example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].

    The intended use of this archive is to facilitate meta-analysis of the DataONE dataset network. 

    [1] DataONE, https://www.dataone.org
    [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".
    [3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '\n' | grep -v "hash://" | sort | uniq | wc -l
    [4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '\n' | grep "hash://" | sort | uniq | wc -l
    [5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep  "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\t' '\n' | grep -v "hash://" | sort | uniq | wc -l

    This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

     
    more » « less
  3. A biodiversity dataset graph: UCSB-IZC

    The intended use of this archive is to facilitate (meta-)analysis of the UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC). UCSB-IZC is a natural history collection of invertebrate zoology at Cheadle Center of Biodiversity and Ecological Restoration, University of California Santa Barbara.

    This dataset provides versioned snapshots of the UCSB-IZC network as tracked by Preston [2,3] between 2021-10-08 and 2021-11-04 using [preston track "https://api.gbif.org/v1/occurrence/search/?datasetKey=d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0"].

    This archive contains 14349 images related to 32533 occurrence/specimen records. See included sample-image.jpg and their associated meta-data sample-image.json [4].

    The images were counted using:

    $ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c\
     | grep -o -P ".*depict"\
     | sort\
     | uniq\
     | wc -l

    And the occurrences were counted using:

    $ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c\
     | grep -o -P "occurrence/([0-9])+"\
     | sort\
     | uniq\
     | wc -l

    The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to establish a versioning mechanism.

    To retrieve and verify the downloaded UCSB-IZC biodiversity dataset graph, first download preston-*.tar.gz. Then, extract the archives into a "data" folder. Alternatively, you can use the Preston [2,3] command-line tool to "clone" this dataset using:

    $ java -jar preston.jar clone --remote https://archive.org/download/preston-ucsb-izc/data.zip/,https://zenodo.org/record/5557670/files,https://zenodo.org/record/5557670/files/5660088

    After that, verify the index of the archive by reproducing the following provenance log history:

    $ java -jar preston.jar history
    <urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36> .
    <hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c> <http://purl.org/pav/previousVersion> <hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36> .

    To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.

    $ java -jar preston.jar verify
    hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    file:/home/jhpoelen/ucsb-izc/data/ce/1d/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    OK    CONTENT_PRESENT_VALID_HASH    66438    hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c
    hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    file:/home/jhpoelen/ucsb-izc/data/f6/8d/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    OK    CONTENT_PRESENT_VALID_HASH    4093    hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844
    hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    file:/home/jhpoelen/ucsb-izc/data/3e/70/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    OK    CONTENT_PRESENT_VALID_HASH    5746    hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef
    hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    file:/home/jhpoelen/ucsb-izc/data/99/58/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    OK    CONTENT_PRESENT_VALID_HASH    6147    hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b

    Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".

    Files in this data publication:

    --- start of file descriptions ---

    -- description of archive and its contents (this file) --
    README

    -- executable java jar containing preston [2,3] v0.3.1. --
    preston.jar

    -- preston archive containing UCSB-IZC (meta-)data/image files, associated provenance logs and a provenance index --
    preston-[00-ff].tar.gz

    -- individual provenance index files --
    2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

    -- example image and meta-data --
    sample-image.jpg (with hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c)
    sample-image.json (with hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844)

    --- end of file descriptions ---


    References

    [1] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-11-04 as indexed by the Global Biodiversity Informatics Facility (GBIF) with provenance hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36 hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c.
    [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 .
    [3] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132
    [4] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08. https://www.gbif.org/occurrence/3323647301 . hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c

    This work is funded in part by grant NSF OAC 1839201 and NSF DBI 2102006 from the National Science Foundation. 
    more » « less
  4. The PoseASL dataset consists of color and depth videos collected from ASL signers at the Linguistic and Assistive Technologies Laboratory under the direction of Matt Huenerfauth, as part of a collaborative research project with researchers at the Rochester Institute of Technology, Boston University, and the University of Pennsylvania. Access: After becoming an authorized user of Databrary, please contact Matt Huenerfauth if you have difficulty accessing this volume. We have collected a new dataset consisting of color and depth videos of fluent American Sign Language signers performing sequences ASL signs and sentences. Given interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the video files, we share depth data files from a Kinect v2 sensor, as well as additional motion-tracking files produced through post-processing of this data. Organization of the Dataset: The dataset is organized into sub-folders, with codenames such as "P01" or "P16" etc. These codenames refer to specific human signers who were recorded in this dataset. Please note that there was no participant P11 nor P14; those numbers were accidentally skipped during the process of making appointments to collect video stimuli. Task: During the recording session, the participant was met by a member of our research team who was a native ASL signer. No other individuals were present during the data collection session. After signing the informed consent and video release document, participants responded to a demographic questionnaire. Next, the data-collection session consisted of English word stimuli and cartoon videos. The recording session began with showing participants stimuli consisting of slides that displayed English word and photos of items, and participants were asked to produce the sign for each (PDF included in materials subfolder). Next, participants viewed three videos of short animated cartoons, which they were asked to recount in ASL: - Canary Row, Warner Brothers Merrie Melodies 1950 (the 7-minute video divided into seven parts) - Mr. Koumal Flies Like a Bird, Studio Animovaneho Filmu 1969 - Mr. Koumal Battles his Conscience, Studio Animovaneho Filmu 1971 The word list and cartoons were selected as they are identical to the stimuli used in the collection of the Nicaraguan Sign Language video corpora - see: Senghas, A. (1995). Children’s Contribution to the Birth of Nicaraguan Sign Language. Doctoral dissertation, Department of Brain and Cognitive Sciences, MIT. Demographics: All 14 of our participants were fluent ASL signers. As screening, we asked our participants: Did you use ASL at home growing up, or did you attend a school as a very young child where you used ASL? All the participants responded affirmatively to this question. A total of 14 DHH participants were recruited on the Rochester Institute of Technology campus. Participants included 7 men and 7 women, aged 21 to 35 (median = 23.5). All of our participants reported that they began using ASL when they were 5 years old or younger, with 8 reporting ASL use since birth, and 3 others reporting ASL use since age 18 months. Filetypes: *.avi, *_dep.bin: The PoseASL dataset has been captured by using a Kinect 2.0 RGBD camera. The output of this camera system includes multiple channels which include RGB, depth, skeleton joints (25 joints for every video frame), and HD face (1,347 points). The video resolution produced in 1920 x 1080 pixels for the RGB channel and 512 x 424 pixels for the depth channels respectively. Due to limitations in the acceptable filetypes for sharing on Databrary, it was not permitted to share binary *_dep.bin files directly produced by the Kinect v2 camera system on the Databrary platform. If your research requires the original binary *_dep.bin files, then please contact Matt Huenerfauth. *_face.txt, *_HDface.txt, *_skl.txt: To make it easier for future researchers to make use of this dataset, we have also performed some post-processing of the Kinect data. To extract the skeleton coordinates of the RGB videos, we used the Openpose system, which is capable of detecting body, hand, facial, and foot keypoints of multiple people on single images in real time. The output of Openpose includes estimation of 70 keypoints for the face including eyes, eyebrows, nose, mouth and face contour. The software also estimates 21 keypoints for each of the hands (Simon et al, 2017), including 3 keypoints for each finger, as shown in Figure 2. Additionally, there are 25 keypoints estimated for the body pose (and feet) (Cao et al, 2017; Wei et al, 2016). Reporting Bugs or Errors: Please contact Matt Huenerfauth to report any bugs or errors that you identify in the corpus. We appreciate your help in improving the quality of the corpus over time by identifying any errors. Acknowledgement: This material is based upon work supported by the National Science Foundation under award 1749376: "Collaborative Research: Multimethod Investigation of Articulatory and Perceptual Constraints on Natural Language Evolution." 
    more » « less
  5. A biodiversity dataset graph: UCSB-IZC

    The intended use of this archive is to facilitate (meta-)analysis of the UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC). UCSB-IZC is a natural history collection of invertebrate zoology at Cheadle Center of Biodiversity and Ecological Restoration, University of California Santa Barbara.

    This dataset provides versioned snapshots of the UCSB-IZC network as tracked by Preston [2,3] on 2021-10-08 using [preston track "https://api.gbif.org/v1/occurrence/search/?datasetKey=d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0"].

    This archive contains 14137 images related to 33730 occurrence/specimen records. See included sample-image.jpg and their associated meta-data sample-image.json [4].

    The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to establish a versioning mechanism.

    To retrieve and verify the downloaded UCSB-IZC biodiversity dataset graph, first download preston-*.tar.gz. Then, extract the archives into a "data" folder. Alternatively, you can use the Preston [2,3] command-line tool to "clone" this dataset using:

    $ java -jar preston.jar clone --remote https://archive.org/download/preston-ucsb-izc/data.zip/,https://zenodo.org/record/5557670/files

    After that, verify the index of the archive by reproducing the following provenance log history:

    $ java -jar preston.jar history
    <urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36> .

    To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.

    $ java -jar preston.jar verify
    hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    file:/home/jhpoelen/ucsb-izc/data/ce/1d/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    OK    CONTENT_PRESENT_VALID_HASH    66438    hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c
    hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    file:/home/jhpoelen/ucsb-izc/data/f6/8d/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    OK    CONTENT_PRESENT_VALID_HASH    4093    hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844
    hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    file:/home/jhpoelen/ucsb-izc/data/3e/70/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    OK    CONTENT_PRESENT_VALID_HASH    5746    hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef
    hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    file:/home/jhpoelen/ucsb-izc/data/99/58/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    OK    CONTENT_PRESENT_VALID_HASH    6147    hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b

    Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".

    Files in this data publication:

    --- start of file descriptions ---

    -- description of archive and its contents (this file) --
    README

    -- executable java jar containing preston [2,3] v0.3.1. --
    preston.jar

    -- preston archive containing UCSB-IZC (meta-)data/image files, associated provenance logs and a provenance index --
    preston-[00-ff].tar.gz

    -- individual provenance index files --
    2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

    -- example image and meta-data --
    sample-image.jpg (with hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c)
    sample-image.json (with hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844)

    --- end of file descriptions ---


    References

    [1] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08 as indexed by the Global Biodiversity Informatics Facility (GBIF) with provenance hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36.
    [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 .
    [3] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132
    [4] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08. https://www.gbif.org/occurrence/3323647301 . hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c

    This work is funded in part by grant NSF OAC 1839201 and NSF DBI 2102006 from the National Science Foundation. 
    more » « less