skip to main content


Title: Replication Data for: Multi-Point Nanoindentation Method to Determine Mechanical Anisotropy in Nanofibrillar Thin Films
Raw data of scanning electron microscopy (SEM), atomic force microscopy (AFM), force spectroscopy, data analysis and plotting, optical microscopy, and finite element simulations (FEA) for our manuscript. File Formats AFM raw data is provided in Gwyddion format, which can be viewed using the Gwyddion AFM viewer, which has been released under the GNU public software licence GPLv3 and can be downloaded free of charge at http://gwyddion.net/ Optical microscopy data is provided in JPEG format SEM raw data is provided in TIFF format Data analysis codes were written in MATLAB (https://www.mathworks.com/products/matlab) and stored as *.m files Imported raw data to MATLAB and saved MATLAB data were stored as MATLAB multidimensional arrays (MATLAB “struct” data format, *.mat files) FEA results were saved as text files, .txt files) Data (Folder Structure) The data in the dataverse is best viewed in Tree mode. Read me file.docx More Explanations of analysis in docx format. Figure 1 Figure 1 - panel b.jpg (5.5 MB) Optical micrograph (JPEG format) Figure 1 - panel c - AFM Raw Data.gwy (8.0 MB) AFM raw data (Gwyddion format) Figure 1 - panel e - P0_Force-curve_raw_data.txt (3 KB) Raw force-displacement data at P0 (text format) Figure 1 - panel e - Px_Force-curve_raw_data.txt (3 KB) Raw force-displacement data at Px (text format) Figure 1 - panel e - Py_Force-curve_raw_data.txt (3 KB) Raw force-displacement data at Py (text format) Figure 1 - panel e - P0_simulation_raw_data.txt (12 KB) FEA simulated force-distance data at P0 (text format) Figure 1 - panel e - Px_simulation_raw_data.txt (12 KB) FEA simulated force-distance data at Px (text format) Figure 1 - panel e - Py_simulation_raw_data.txt (12 KB) FEA simulated force-distance data at Py (text format) Figure 1 - panel e - FCfindc.m (2 KB) MATLAB code to calculate inverse optical lever sensitivity (InverseOLS) of AFM cantelever (matlab .m format) Figure 1 - panel e - FreqFindANoise_new.m (2 KB) MATLAB code to calculate white noise constant, A (Explained in the Read me file) (matlab .m format) Figure 1 - panel e - FreqFindQ_new.m (4 KB) MATLAB code to calculate Q factor of the AFM cantelever (matlab .m format) Figure 1 - panel e - FCkeff.m (2 KB) MATLAB code to calculate the effective spring constant k of the AFM cantelever (matlab .m format) Figure 1 - panel e - FCimport.m (7 KB) MATLAB code to import raw force-displacement data into MATLAB (matlab .m format) Figure 1 - panel e - FCForceDist.m (2 KB) MATLAB code to convert raw force-displacement data into force-distance data (matlab .m format) Figure 1 - panel e - Figure 1- Panel e - data.mat (6 KB) MATALB struct data file for calibrated force-distance data at all indentation points (matlab .mat format) Figure 1 - panel e - Panel_e_MatlabCode.m (6 KB) MATALB code for plotting experimental and simulated force curves in panel e (matlab .m format) Figure 1 - panel e - Read me file - force curve calibration.docx (14 KB) Explains force curve calibration (.docx format) Figure 1 - panel e - Read me file - lever spring constant calibration.docx (14 KB) Explains AFM lever spring constant calibration (.docx format) Figure 2 Figure 2 - panel a - MATLAB data.mat (2.6 KB) MATALB data file for simulated data (matlab .mat format) Figure 2 - panel b - MATLAB data.mat (2.4 KB) MATALB data file for simulated data (matlab .mat format) Figure 2 - panel a - simulation raw data.txt (5.0 KB) Raw simulation data: xyz coordinates of the nodes of deformed FEA mesh (text format) Figure 2 - panel b - simulation raw data.txt (5.0 KB) Raw simulation data: xyz coordinates of the nodes of deformed FEA mesh (text format) Figure 2 - panel ab - MATLABcode.m (1.0 KB) MATALB code for plotting panel a b figures (matlab .m format) Figure 2 - panel c - Degree of Anisotropy datacode.m (1.0 KB) MATALB code for plotting panel c graph (matlab .m format) Figure 3 Figure 3 - panel a - App_curve_1_raw_data.txt (35 KB) Raw force-displacement data approach curve 1 (text format) Figure 3 - panel a - App_curve_2_raw_data.txt (34 KB) Raw force-displacement data approach curve 2 (text format) Figure 3 - panel a - App_curve_3_raw_data.txt (34 KB) Raw force-displacement data approach curve 3 (text format) Figure 3 - panel a - App_curve_4_raw_data.txt (34 KB) Raw force-displacement data approach curve 4 (text format) Figure 3 - panel a - Ret_curve_1_raw_data.txt (35 KB) Raw force-displacement data of retract curve 1 (text format) Figure 3 - panel a - Ret_curve_2_raw_data.txt (35 KB) Raw force-displacement data of retract curve 2 (text format) Figure 3 - panel a - Ret_curve_3_raw_data.txt (35 KB) Raw force-displacement data of retract curve 3 (text format) Figure 3 - panel a - Simulation_raw_data-part 1.txt (43 KB) simulated force-displacement data of -part 1 (text format) Figure 3 - panel a - Simulation_raw_data-part 2.txt (43 KB) simulated force-displacement data of -part 2 (text format) Figure 3 - panel a - FCfindc.m (2 KB) MATLAB code to calculate inverse optical lever sensitivity (InverseOLS) of AFM cantelever (matlab .m format) Figure 3 - panel a - FreqFindANoise_new.m (2 KB) MATLAB code to calculate white noise constant, A (Explained in the Read me file) (matlab .m format) Figure 3 - panel a - FreqFindQ_new.m (4 KB) MATLAB code to calculate Q factor of the AFM cantelever (matlab .m format) Figure 3 - panel a - FCkeff.m (2 KB) MATLAB code to calculate the effective spring constant k of the AFM cantelever (matlab .m format) Figure 3 - panel a - FCimport.m (7 KB) MATLAB code to import raw force-displacement data into MATLAB (matlab .m format) Figure 3 - panel a - FCForceDist.m (2 KB) MATLAB code to convert raw force-displacement data into force-distance data (matlab .m format) Figure 1 - panel e - Read me file - force curve calibration.docx (14 KB) Explains force curve calibration (.docx format) Figure 1 - panel e - Read me file - lever spring constant calibration.docx (14 KB) Explains AFM lever spring constant calibration (.docx format) Figure 3 - panel b - SEM Raw Data.tiff (9 KB) SEM raw image of broken silk membrane due to extreme indentation (.tiff format)  more » « less
Award ID(s):
1905902 2105158
NSF-PAR ID:
10383123
Author(s) / Creator(s):
; ;
Publisher / Repository:
Harvard Dataverse
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Raw data of optical microscopy, scanning electron microscopy (SEM), atomic force microscopy (AFM), and diameter measurements of the exfoliated and self-assembled nanofibrils for our manuscript. File Formats AFM raw data is provided in Gwyddion format, which can be viewed using the Gwyddion AFM viewer, which has been released under the GNU public software licence GPLv3 and can be downloaded free of charge at http://gwyddion.net/ Optical microscopy data is provided in JPEG format SEM raw data is provided in TIFF format Data analysis codes were written in MATLAB (https://www.mathworks.com/products/matlab) and stored as *.m files Data analysis results were stored as MATLAB multidimensional arrays (MATLAB “struct” data format, *.mat files) Data (Folder Structure) The data in the dataverse is best viewed in Tree mode. ReadMe.md This description in Markdown format. Figure 2 - Microscopy Raw Data Figure 2 - panel a.jpg (7.2 MB) Optical micrograph (JPEG format) Figure 2 - panel b.jpg (6.1 MB) Optical micrograph (JPEG format) Figure 2 - panel c f.tif (1.2 MB) SEM raw data (TIFF format) Figure 2 - panel d.tif (1.2 MB) SEM raw data (TIFF format) Figure 2 - panel e - Exfoliated Fibrils.gwy (32.0 MB) AFM raw data (Gwyddion format) Figure 3 - AFM Raw Data Figure 3 - Panel a - Exfoliated fibrils.gwy (81.5 MB) AFM raw data (Gwyddion format) Figure 3 - Panel c - Self-assembled fibrils.gwy (24.0 MB) AFM raw data (Gwyddion format) Figure 3 - Diameter Measurements Figure 3a and Figure 3c show the AFM images of exfoliated and self-assembled nanofibrils, respectively. However, due to the AFM tip-induced broadening of lateral dimensions of small features (such as nanofibrils), the diameters of nanofibrils are generally overestimated in AFM images. Hence, the diameters of the nanofibrils were estimated as the full width at half maximum (FWHM) value of line scans taken over nanofibrils perpendicular to their axial direction. Line profiles were taken at multiple locations using Gwyddion, and the raw data were stored in MATLAB struct files (lineProfileData_Exfoliated.mat and lineProfileData_Self-Assembled.mat). These data files can be directly imported into MATLAB and will appear as “DataExf” and “DataSA” in MATLAB workspace. For instance, “DataExf.x{i}” contains the x-axis data of i-th line profile, and “DataExf.y{i}” contains the y-axis data of i-th line profile. The MATLAB codes MainCode_Exf.m and MainCode_SA.m are used to fit Gaussian curves for each line profile and calculate the FWHM. The *.m files for functions gaussian.m and createFit.m must be in the same folder as the file for the main code. The main code generates figures for each line profile containing raw line profile, related Gaussian fit, and FWHM. These FWHM values are considered as the diameters of the fibrils and stored in variables called “Exf_Dia” and “SA_Dia”. Finally, these values are plotted in a histogram and calculate the statistics such as the mean and the standard deviation. Exfoliated createFit.m (1.1 KB) MATLAB code file (see above) gaussian.m (134 B) MATLAB code file (see above) lineProfileData_Exfoliated.mat (11.7 KB) Line profiles for exfoliated nanofibrils (MATLAB struct format) MainCode_Exf.m (1.8 KB) MATLAB code file (see above) Line profile raw data - Exfoliated Folder with all corresponding cross section raw data in ASCII format Self Assembled createFit.m (1.1 KB) MATLAB code file (see above) gaussian.m (134 B) MATLAB code file (see above) lineProfileData_Self-Assembled.mat (9.9 KB) Line profiles for self-assembled nanofibrils (MATLAB struct format) MainCode_SA.m (1.8 KB) MATLAB code file (see above) Line profile raw data - SelfAssembled Folder with all corresponding cross section raw data in ASCII format Figure 4 - AFM Raw Data Figure 4 - Panal a.gwy (73.4 MB) AFM raw data (Gwyddion format) Figure 4 - Panel e.gwy (42.0 MB) AFM raw data (Gwyddion format) 
    more » « less
  2. The raw data for the associated manuscript is organized here into three categories: 1) relating to the measurement and analysis of the native recluse spiders loop junctions, 2) raw images found in the figures throughout the manuscript, and 3) relating to the experiments testing the effect that junction angle has on the strength of two intersecting tapes. It is recommended to browse the data files in Tree mode, which will make the files appear in folders reflecting this organization. 1) Loxosceles Loop Junction Images and Analysis The folder titled, SEM Raw Images, has all of the scanning electron microscopy (SEM) images taken of the native recluse loop junctions. Some images are close-ups of individual junctions and others take a broader perspective (macro) of many loop junctions in series. Where possible several close-up images of the individual junctions are accompanied with a macro image. These images were imported into ImageJ where the junction angle was measured. The measurements for all 41 loop junctions observed are in the folder titled, Raw Data Files in the file titled, Loxosceles Loop Junction Angle Measurements.txt. The folder titled, Raw Data Files contains, in addition to the angle measurements, the raw data for analyzing the strength of individual loop junctions. The data is in native MATLAB data format. These datasets include the complete tensile data and the cross-sectional area data for each spiders silk. The MATLAB code titled, Figure_2A_2B_code, processes the raw tensile data from the natural recluse spiders loop junctions. This data is plotted as two representative curves in Figure 2A and as a complete set as a histogram in Figure 2B. The MATLAB code titled, Figure_7_code, processes and plots the loop junction data found in, Loxosceles Loop Junction Angle Measurements.txt and executed the model of a random set of recluse loops. This code can be executed to generate Figure 7. The folder titled, Raw Data Files, must be open in MATLAB to run this code! This code uses the MATLAB function, areacalculation, to calculate the junction area for a given junction angle. 2) Raw Images This folder is organized by the respective figure in the manuscript where each image can be found. Additional metadata for each image can be found accompanying each image. 3) Tensile Data and Analysis This folder contains all of the raw tensile data for all tape-tape junction experiments conducted. All of the tensile data is in the folder titled, Raw Data Test Files. Within this folder is a .txt file for each sample tested. The file names are critical to the figure codes working properly because they contain the information for the junction angle and iterations. The file names are in the format year-month-day_trialnumber_junctionangle.txt. Also in the Raw Data Test Files folder are two functions used within some of the figure codes: fbfill and areacalculation. These functions will be used in the figure codes to properly analyze the data. To generate any figure using the MATLAB code in this folder, first open the code in MATLAB. Then within MATLAB, open the folder Raw Data Test Files. Only with this folder open in MATLAB will the code be able to find the correct raw data .txt files. The rest of the contents of this folder are MATLAB codes for specific figures in the manuscript. The only exception to this is the code titled, surfaceenergy_code, which is executed to calculate the phenomenological surface energy for the tapes used in these experiments. 
    more » « less
  3. This dataset contains raw data, processed data, and the codes used for data processing in our manuscript from our Fourier-transform infrared (FTIR) spectroscopy, Nuclear magnetic resonance (NMR), Raman spectroscopy, and X-ray diffraction (XRD) experiments. The data and codes for the fits of our unpolarized Raman spectra to polypeptide spectra is also included. The following explains the folder structure of the data provided in this dataset, which is also explained in the file ReadMe.txt. Browsing the data in Tree view is recommended. Folder contents Codes Raman Data Processing: The MATLAB script file RamanDecomposition.m contains the code to decompose the sub-peaks across different polarized Raman spectra (XX, XZ, ZX, ZZ, and YY), considering a set of pre-determined restrictions. The helper functions used in RamanDecomposition.m are included in the Helpers folder. RamanDecomposition.pdf is a PDF printout of the MATLAB code and output. P Value Simulation: 31_helix.ipynb and a_helix.ipynb: These two Jupyter Notebook files contain the intrinsic P value simulation for the 31-helix and alpha-helix structures. The simulation results were used to prepare Supplementary Table 4. See more details in the comments contained. Vector.py, Atom.py, Amino.py, and Helpers.py: These python files contains the class definitions used in 31_helix.ipynb and a_helix.ipynb. See more details in the comments contained. FTIR FTIR Raw Transmission.opj: This Origin data file contains the raw transmission data measured on single silk strand and used for FTIR spectra analysis. FTIR Deconvoluted Oscillators.opj: This Origin data file was generated from the data contained in the previous file using W-VASE software from J. A. Woollam, Inc. FTIR Unpolarized MultiStrand Raw Transmission.opj: This Origin data file contains the raw transmission data measured on multiple silk strands. The datasets contained in the first two files above were used to plot Figure 2a-b and the FTIR data points in Figure 4a, and Supplementary Figure 6. The datasets contained in the third file above were used to plot Supplementary Figure 3a. The datasets contained in the first two files above were used to plot Figure 2a-b, FTIR data points in Figure 4a, and Supplementary Figure 6. NMR Raw data files of the 13C MAS NMR spectra: ascii-spec_CP.txt: cross-polarized spectrum ascii-spec_DP.txt: direct-polarized spectrum Data is in ASCII format (comma separated values) using the following columns: Data point number Intensity Frequency [Hz] Frequency [ppm] Polypeptide Spectrum Fits MATLAB scripts (.m files) and Helpers: The MATLAB script file Raman_Fitting_Process_Part_1.m and Raman_Fitting_Process_Part_2.m contains the step-by-step instructions to perform the fitting process of our calculated unpolarized Raman spectrum, using digitized model polypeptide Raman spectra. The Helper folder contains two helper functions used by the above scripts. See the scripts for further instruction and information. Data aPA.csv, bPA.csv, GlyI.csv, GlyII.csv files: These csv files contain the digitized Raman spectra of poly-alanine, beta-alanine, poly-glycine-I, and poly-glycine-II. Raman_Exp_Data.mat: This MATLAB data file contains the processed, polarized Raman spectra obtained from our experiments. Variable freq is the wavenumber information of each collected spectrum. The variables xx, yy, zz, xz, zx represent the polarized Raman spectra collected. These variables are used to calculate the unpolarized Raman spectrum in Raman_Fitting_Process_Part_2.m. See the scripts for further instruction and information. Raman Raman Raw Data.mat: This MATLAB data file contains all the raw data used for Raman spectra analysis. All variables are of MATLAB structure data type. Each variable has fields called Freq and Raw, with Freq contains the wavenumber information of the measured spectra and Raw contains 5 measured Raman signal strengths. Variable XX, XZ, ZX, ZZ, and YY were used to plot and sub-peak analysis for Figure 2c-d, Raman data points in Figure 4a, Figure 5b, Supplementary Figure 2, and Supplementary Figure 7. Variable WideRange was used to plot and identify the peaks for Supplementary Figure 3b. X-Ray X-Ray.mat: This MATLAB data file contains the raw X-ray data used for the diffraction analysis in Supplementary Figure 5. 
    more » « less
  4. This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.

    Version: 2.0

    Changes versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.


    Conventions Used in These Files
    ===============================

    Structure Files
    ----------------
    - graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass), as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)

    - graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)
    - repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)
    - freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)
    - amber_*.prmtop (combined topology and parameter files for Amber force field simulations)
    - repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)

    Force Field Parameters
    ----------------------
    CHARMM format parameter files:
    - par_all36m_prot.prm (CHARMM36m FF for proteins)
    - par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.
    - toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)

    Template NAMD Configuration Files
    ---------------------------------
    These contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):
    - template_min.namd (minimization)
    - template_eq.namd (NPT equilibration with lower graphene fixed)
    - template_abf.namd (for adaptive biasing force)

    Minimization
    -------------
    - namd/min_*.0.namd

    Equilibration
    -------------
    - namd/eq_*.0.namd

    Adaptive biasing force calculations
    -----------------------------------
    - namd/eabfZRest7_graph_chp1404.0.namd
    - namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)

    Log Files
    ---------
    For each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.

    Simulation Output
    -----------------
    The simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.

    Scripts
    -------
    Files with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.


    CONTENTS
    ========

    The directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.

    Sim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite–water interface.

    Sim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite–water interface.

    Sim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.

    Sim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 295 K.

    Sim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.

    Sim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).

    Sim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.

    Sim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene–water interface in the initial configuration.

    Sim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.

    Sim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite–water interface at 370 K.

    Sim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.

     

    This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030. 
    more » « less
  5. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath. 
    more » « less