skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Attack data from electronic logging device for medium and heavy duty vehicles
{"Abstract":["Modern commercial vehicles are required by law to be equipped with\n Electronic Logging Devices (ELDs) in an effort to make it easier to track,\n manage, and share records of duty status (RODS) data. Research has shown\n that integration of third-party ELDs into commercial trucks can introduce\n a significant cybersecurity risk. This includes the ability of nefarious\n actors to modify firmware on ELDs to gain the ability to arbitrarily write\n messages to the Controller Area Network (CAN) within the vehicle.\n Additionally, a proof-of-concept self-propagating truck-to-truck worm has\n been demonstrated.  This dataset was collected during controlled\n testing on a Kenworth T270 Class 6 truck with a commercially available\n ELD, during which the firmware on the ELD was replaced remotely over a\n Wi-Fi connection from an adjacently driving passenger vehicle. The\n compromised ELD then gained the ability to perform arbitrary CAN message\n writes of the attacker’s choice. The dataset contains CAN traffic in the\n `candump` format collected using the Linux `socketcan` tool. \n After taking control of the ELD, the attacker writes Torque Speed control\n messages onto the CAN network, impersonating the Transmission Control\n Module (TCM). These messages command the Engine Control Module (ECM) to\n request 0% torque output, effectively disabling the driver’s control of\n the accelerator and forcing the truck to idle."],"TechnicalInfo":["## Attack data for electronic logging device vulnerability for medium and\n heavy duty vehicles ## Dataset Overview This dataset contains Controller\n Area Network (CAN) logs captured using `candump` from the **SocketCAN**\n framework during a remote drive-by attack on an electronic logging device\n (ELD). The attack is detailed as a public advisory through CISA at:\n [https://www.cisa.gov/news-events/ics-advisories/icsa-24-093-01](https://www.cisa.gov/news-events/ics-advisories/icsa-24-093-01). The logs are in a traditional `.log` format, preserving raw CAN messages, timestamps, and metadata. This dataset is intended for research, forensic analysis, anomaly detection, and reverse engineering of vehicular communication networks. ## File Format Each `.log` file follows the standard `candump` output format: ``` (1623847291.123456) can0 0CF00400 [8] FF FF FF FF FF FF FF FF ``` ### Explanation: * **Timestamps** (`(1623847291.123456)`) – Epoch time with microsecond precision. * **CAN Interface** (`can0`) – The name of the CAN bus interface used for capturing. * **CAN ID** (`0CF00400`) – The hexadecimal identifier of the CAN frame. * **DLC** (`[8]`) – Data Length Code, indicating the number of bytes in the data field. * **Data** (`FF FF FF FF FF FF FF FF`) – The payload transmitted in the CAN message. ## Dataset Contents * `Wireless_Pedal_Jam.log` – Raw CAN logs collected on a specific date. ## Capture Environment * **Hardware Used**: SocketCAN * **Software Used**: `candump` from the `can-utils` package on Linux. * **Vehicle/System**: 2014 Kenworth T270 * **Bus Type**: J1939 ## Usage To analyze the dataset, you can use the following tools: * **`candump`** (for live monitoring) * **`canplayer`** (to replay logs) * **`can-utils`** (`cansniffer`, `canbusload`, `canlogserver`, etc.) * **Python with `python-can`** (for programmatic parsing) * **Wireshark** (for visualization) ### Example Commands #### Replaying the Log File ``` canplayer -I dataset_YYYYMMDD.log ``` #### Filtering Messages by CAN ID: ``` cat dataset_YYYYMMDD.log | grep "0CF00400" ``` #### Converting Logs to CSV **Using Python:** ``` import pandas as pd log_file = "dataset_YYYYMMDD.log" data = [] with open(log_file, "r") as f: for line in f: parts = line.strip().split() if len(parts) >= 5: timestamp = parts[0].strip("()") interface = parts[1] can_id = parts[2] dlc = parts[3].strip("[]") data_bytes = " ".join(parts[4:]) data.append([timestamp, interface, can_id, dlc, data_bytes]) df = pd.DataFrame(data, columns=["Timestamp", "Interface", "CAN_ID", "DLC", "Data"]) df.to_csv("dataset.csv", index=False) ```"]}  more » « less
Award ID(s):
2213735
PAR ID:
10660219
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Dryad
Date Published:
Edition / Version:
10
Subject(s) / Keyword(s):
FOS: Engineering and technology FOS: Engineering and technology Vehicular Cyber Security Electronic Logging Devices Attack Datasets J1939 Medium and Heavy Duty Vehicles Commercial Vehicles CAN Data
Format(s):
Medium: X Size: 13876725 bytes
Size(s):
13876725 bytes
Sponsoring Org:
National Science Foundation
More Like this
  1. Revision: This revision includes four independent trajectory values of the ensemble averages of the mean squared radii of gyration and their standard deviations, which can be used to compute statistical measures such as the standard error. This distribution provides access to 18,450 configurations of coarse-grained polymers. The data is provided as a serialized object using the `pickle' Python module and in csv format. The data was compiled using Python version 3.8.  ReferencesThe specific applications and analyses of the data are described in 1.  Jiang, S.; Webb, M.A. "Physics-Guided Neural Networks for Transferable Prediction of Polymer Properties" DataThere are seven .pickle files that contain serialized Python objects. pattern_graph_data_*_*_rg_new.pickle: squared radii of gyration distribution from MD simulation. The number indicates the molecular weight range. rg2_baseline_*_new.pickle: squared radii of gyration distribution from Gaussian chain theoretical prediction. delta_data_v0314.pickle: torch_geometric training data. UsageTo access the data in the .pickle file, users can execute the following: # LOAD SIMULATION DATADATA_DIR = "your/custom/dir/"mw = 40 # or 90, 190 MWs filename = os.path.join(DATA_DIR, f"pattern_graph_data_{mw}_{mw+20}_rg_new.pickle")with open(filename, "rb") as handle:    graph = pickle.load(handle)    label = pickle.load(handle)    desc  = pickle.load(handle)    meta  = pickle.load(handle)    mode  = pickle.load(handle)    rg2_mean   = pickle.load(handle)    rg2_std    = pickle.load(handle) ** 0.5 # var # combine asymmetric and symmetric star polymerslabel[label == 'stara'] = 'star'# combine bottlebrush and other comb polymerslabel[label == 'bottlebrush'] = 'comb'  # LOAD GAUSSIAN CHAIN THEORETICAL DATAwith open(os.path.join(DATA_DIR, f"rg2_baseline_{mw}_new.pickle"), "rb") as handle:    rg2_mean_theo = pickle.load(handle)[:, 0]    rg2_std_theo = pickle.load(handle)[:, 0] graph: NetworkX graph representations of polymers. label: Architectural classes of polymers (e.g., linear, cyclic, star, branch, comb, dendrimer). desc: Topological descriptors (optional). meta: Identifiers for unique architectures (optional). mode: Identifiers for unique chemical patterns (optional). rg2_mean: Mean squared radii of gyration from simulations. rg2_std: Corresponding standard deviation from simulations. rg2_mean_theo: Mean squared radii of gyration from theoretical models. rg2_std_theo: Corresponding standard deviation from theoretical models. Help, Suggestions, Corrections?If you need help, have suggestions, identify issues, or have corrections, please send your comments to Shengli Jiang at sj0161@princeton.edu GitHubAdditional data and code relevant for this study is additionally accessible at https://github.com/webbtheosim/gcgnn 
    more » « less
  2. {"Abstract":["This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.<\/p>\n\nVersion: 2.0<\/p>\n\nChanges versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.<\/p>\n\n\nConventions Used in These Files\n===============================<\/p>\n\nStructure Files\n----------------\n- graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass), as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)<\/p>\n\n- graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)\n- repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)\n- freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)\n- amber_*.prmtop (combined topology and parameter files for Amber force field simulations)\n- repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)<\/p>\n\nForce Field Parameters\n----------------------\nCHARMM format parameter files:\n- par_all36m_prot.prm (CHARMM36m FF for proteins)\n- par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.\n- toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)<\/p>\n\nTemplate NAMD Configuration Files\n---------------------------------\nThese contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):\n- template_min.namd (minimization)\n- template_eq.namd (NPT equilibration with lower graphene fixed)\n- template_abf.namd (for adaptive biasing force)<\/p>\n\nMinimization\n-------------\n- namd/min_*.0.namd<\/p>\n\nEquilibration\n-------------\n- namd/eq_*.0.namd<\/p>\n\nAdaptive biasing force calculations\n-----------------------------------\n- namd/eabfZRest7_graph_chp1404.0.namd\n- namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)<\/p>\n\nLog Files\n---------\nFor each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.<\/p>\n\nSimulation Output\n-----------------\nThe simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.<\/p>\n\nScripts\n-------\nFiles with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.<\/p>\n\n\nCONTENTS\n========<\/p>\n\nThe directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.<\/p>\n\nSim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite\u2013water interface.<\/p>\n\nSim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite\u2013water interface.<\/p>\n\nSim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.<\/p>\n\nSim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 295 K.<\/p>\n\nSim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.<\/p>\n\nSim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).<\/p>\n\nSim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.<\/p>\n\nSim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene\u2013water interface in the initial configuration.<\/p>\n\nSim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite\u2013water interface at 370 K.<\/p>\n\nSim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.<\/p>\n\n <\/p>"],"Other":["This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030."]} 
    more » « less
  3. This dataset holds 1036 ternary phase diagrams and how points on the diagram phase separate if they do. The data is provided as a serialized object using the `pickle' Python module. The data was compiled using Python version 3.8.  ReferencesThe specific applications and analyses of the data are described in 1.  Dhamankar, S.; Jiang, S.; Webb, M.A. "Accelerating Multicomponent Phase-Coexistence Calculations with Physics-informed Neural Networks" UsageTo access the data in the .pickle file, users can execute the following: # LOAD SIMULATION DATADATA_DIR = "your/custom/dir/" filename = os.path.join(DATA_DIR, f"data_clean.pickle")with open(filename, "rb") as handle:    (x, y_c, y_r, phase_idx, num_phase, max_phase) = pickle.load(handle) x: Input x = (χ_AB, χ_BC, χ_AC, v_A, v_B, v_C, φ_A, φ_B) ∈ ℝ^8. y_c: Output one-hot encoded classification vector y_c ∈ ℝ^3. y_r: Output equilibrium composition and abundance vector y_r = (φ_A^α, φ_B^α, φ_A^β, φ_B^β, φ_A^γ, φ_B^γ, w^α, w^β, w^γ) ∈ ℝ^9. phase_idx: A single integer indicating which unique phase system it belongs to. num_phase: A single integer indicates the number of equilibrium phases the input splits into. max_phase: A single integer indicates the maximum number of equilibrium phases the system splits into. Help, Suggestions, Corrections?If you need help, have suggestions, identify issues, or have corrections, please send your comments to Shengli Jiang at sj0161@princeton.edu GitHubAdditional data and code relevant for this study is additionally accessible at hthttps://github.com/webbtheosim/ml-ternary-phase 
    more » « less
  4. {"Abstract":["A biodiversity dataset graph: UCSB-IZC<\/p>\n\nThe intended use of this archive is to facilitate (meta-)analysis of the UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC). UCSB-IZC is a natural history collection of invertebrate zoology at Cheadle Center of Biodiversity and Ecological Restoration, University of California Santa Barbara.<\/p>\n\nThis dataset provides versioned snapshots of the UCSB-IZC network as tracked by Preston [2,3] on 2021-10-08 using [preston track "https://api.gbif.org/v1/occurrence/search/?datasetKey=d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0"].<\/p>\n\nThis archive contains 14137 images related to 33730 occurrence/specimen records. See included sample-image.jpg and their associated meta-data sample-image.json [4].<\/p>\n\nThe archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to establish a versioning mechanism.<\/p>\n\nTo retrieve and verify the downloaded UCSB-IZC biodiversity dataset graph, first download preston-*.tar.gz. Then, extract the archives into a "data" folder. Alternatively, you can use the Preston [2,3] command-line tool to "clone" this dataset using:<\/p>\n\n$$ java -jar preston.jar clone --remote https://archive.org/download/preston-ucsb-izc/data.zip/,https://zenodo.org/record/5557670/files<\/p>\n\nAfter that, verify the index of the archive by reproducing the following provenance log history:<\/p>\n\n$$ java -jar preston.jar history\n<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36> .<\/p>\n\nTo check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.<\/p>\n\n$ java -jar preston.jar verify\nhash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    file:/home/jhpoelen/ucsb-izc/data/ce/1d/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c    OK    CONTENT_PRESENT_VALID_HASH    66438    hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c\nhash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    file:/home/jhpoelen/ucsb-izc/data/f6/8d/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844    OK    CONTENT_PRESENT_VALID_HASH    4093    hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844\nhash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    file:/home/jhpoelen/ucsb-izc/data/3e/70/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef    OK    CONTENT_PRESENT_VALID_HASH    5746    hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef\nhash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    file:/home/jhpoelen/ucsb-izc/data/99/58/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b    OK    CONTENT_PRESENT_VALID_HASH    6147    hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b<\/p>\n\nNote that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".<\/p>\n\nFiles in this data publication:<\/p>\n\n--- start of file descriptions ---<\/p>\n\n-- description of archive and its contents (this file) --\nREADME<\/p>\n\n-- executable java jar containing preston [2,3] v0.3.1. --\npreston.jar<\/p>\n\n-- preston archive containing UCSB-IZC (meta-)data/image files, associated provenance logs and a provenance index --\npreston-[00-ff].tar.gz<\/p>\n\n-- individual provenance index files --\n2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a<\/p>\n\n-- example image and meta-data --\nsample-image.jpg (with hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c)\nsample-image.json (with hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844)<\/p>\n\n--- end of file descriptions ---<\/p>\n\n\nReferences<\/p>\n\n[1] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08 as indexed by the Global Biodiversity Informatics Facility (GBIF) with provenance hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36.\n[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 .\n[3] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132\n[4] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08. https://www.gbif.org/occurrence/3323647301 . hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c<\/p>"],"Other":["This work is funded in part by grant NSF OAC 1839201 and NSF DBI 2102006 from the National Science Foundation."]} 
    more » « less
  5. {"Abstract":["The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]). <\/p>\n\nDataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB .  <\/p>\n\nThe archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve. <\/p>\n\nTo facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2]. <\/p>\n\nTo extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:<\/p>\n\ncat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep "hash://" | sort | uniq > eml-hashes.txt<\/p>\n\nextract data.tar.gz using:<\/p>\n\n~/preston-archive$$ tar xzf data.tar.gz <\/p>\n\nthen use Preston to extract each hash using something like:<\/p>\n\n~/preston-archive$$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa\n<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1" packageId="doi:10.18739/A24P9Q" system="https://arcticdata.io" scope="system" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 ~/development/eml/eml.xsd">\n  <dataset>\n    <alternateIdentifier>urn:x-wmo:md:org.aoncadis.www::d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>\n    <alternateIdentifier>d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>\n    <title>Airglow Image Data 2011 4 of 5</title>\n...<\/p>\n\nAlternatively, without using Preston, you can extract the data using the naming convention:<\/p>\n\ndata/[x]/[y]/[z]/[hash]/data<\/p>\n\nwhere x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.<\/p>\n\nFor example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].<\/p>\n\nThe intended use of this archive is to facilitate meta-analysis of the DataONE dataset network. <\/p>\n\n[1] DataONE, https://www.dataone.org\n[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".\n[3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep -v "hash://" | sort | uniq | wc -l\n[4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep "hash://" | sort | uniq | wc -l\n[5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep  "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep -v "hash://" | sort | uniq | wc -l<\/p>\n\nThis work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.<\/p>"]} 
    more » « less