<p>This item contains <strong>version 5.0</strong> of the Madidi Project's full dataset. The zip file contains (1) raw data, which was downloaded from Tropicos (www.tropicos.org) on August 18, 2020; (2)
R scripts used to modify, correct, and clean the raw data; (3) clean data that are the output of the R scripts, and which are the point of departure for most uses of the Madidi Dataset; (4) post-cleaning scripts that obtain additional but non-essential information from the clean data (e.g. by extracting environmental data from rasters); and (5) a miscellaneous collection of additional non-essential information and figures. This item also includes the <strong>Data Use Policy</strong> for this dataset.</p> <p>The core dataset of the Madidi Project consists of a network of ~500 forest plots distributed in and around the Madidi National Park in Bolivia. This network contains 50 permanently marked large plots (1-ha), as well as >450 temporary small plots (0.1-ha). Within the large plots, all woody individuals with a dbh ≥10 cm have been mapped, tagged, measured, and identified. Some of these plots have also been re-visited and information on mortality, recruitment, and growth exists. Within the small plots, all woody individuals with a dbh ≥2.5 cm have been measured and identified. Each plot has some edaphic and topographic information, and some large plots have information on various plant functional traits.</p> <p>The Madidi Project is a collaborative research effort to document and study plant biodiversity in the Amazonia and Tropical Andes of northwestern Bolivia. The project is currently lead by the Missouri Botanical Garden (MBG), in collaboration with the Herbario Nacional de Bolivia. The management of the project is at MBG, where J. Sebastian Tello (sebastian.tello@mobot.org) is the scientific director. The director oversees the activities of a research team based in Bolivia. MBG works in collaboration with other data contributors (currently: Manuel J. Macía [manuel.macia@uam.es], Gabriel Arellano [gabriel.arellano.torres@gmail.com] and Beatriz Nieto [sonneratia@gmail.com]), with a representative from the Herbario Nacional de Bolivia (LPB; Carla Maldonado [carla.maldonado1@gmail.com]), as well as with other close associated researchers from various institutions. For more information regarding the organization and objectives of the Madidi Project, you can visit the project’s website (<strong>www.madidiproject.weebly.com</strong>).</p>
Other
The Madidi project has been supported by generous grants from the National Science Foundation (DEB 0101775, DEB 0743457, DEB 1836353), and the National Geographic Society (NGS 7754-04 and NGS 8047-06). Additional financial support for the Madidi Project has been provided by the Missouri Botanical Garden, the Comunidad de Madrid (Spain), the Universidad Autónima de Madrid, and the Taylor and Davidson families. More>>
Hunt, I.; Husain, S.; Simon, J.; Obeid, I.; Picone, J.(
, IEEE Signal Processing in Medicine and Biology Symposium (SPMB))
Obeid, Iyad; Picone, Joseph; Selesnick, Ivan
(Ed.)
The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per casemore »with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath.« less
This dataset incorporates Mexico City related essential data files associated with Beth Tellman's dissertation: Mapping and Modeling Illicit and Clandestine Drivers of Land Use Change: Urban Expansion in Mexico City and Deforestation in Central America. It contains spatio-temporal datasets covering three domains; i) urban expansion from 1992-2015, ii) district and section electoral records for 6 elections from 2000-2015, iii) land titling (regularization) data for informal settlements from 1997-2012 on private and ejido land. The urban expansion data includes 30m resolution urban land cover for 1992 and 2013 (methods published in Goldblatt et al 2018), and a shapefile of digitized urban informal expansion in conservation land from 2000-2015 using the Worldview-2 satellite. The electoral records include shapefiles with the geospatial boundaries of electoral districts and sections for each election, and .csv files of the number of votes per party for mayoral, delegate, and legislature candidates. The private land titling data includes the approximate (in coordinates) location and date of titles given by the city government (DGRT) extracted from public records (Diario Oficial) from 1997-2012. The titling data on ejido land includes a shapefile of georeferenced polygons taken from photos in the CORETT office or ejido land that has been expropriated
by the government, and including an accompany .csv from the National Agrarian Registry detailing the date and reason for expropriation from 1987-2007. Further details are provided in the dissertation and subsequent article publication (Tellman et al 2021). The Mexico City portion of these data were generated via a National Science Foundation sponsored project (No. 1657773, DDRI: Mapping and Modeling Clandestine Drivers of Urban Expansion in Mexico City). The project P.I. is Beth Tellman with collaborators at ASU (B.L Turner II and Hallie Eakin). Other collaborators include the National Autonomous University of Mexico (UNAM), at the Institute of Geography via Dr. Armando Peralta Higuera, who provided support for two students, Juan Alberto Guerra Moreno and Kimberly Mendez Gomez for validating the Landsat urbanization algorithm. Fidel Serrano-Candela, at the UNAM Laboratory of the National Laboratory for Sustainability Sciences (LANCIS) also provided support for urbanization algorithm development and validation, and Rodrigo Garcia Herrera, who provided support for hosting data at LANCIS (at: http://patung.lancis.ecologia.unam.mx/tellman/). Additional collaborators include Enrique Castelán, who provided support for the informal urbanization data from SEDEMA (Ministry of the Environmental for Mexico City). Electoral, land titling, and land zoning data were digitized with support from Juana Martinez, Natalia Hernandez, Alexia Macario Sanchez, Enrique Ruiz Durazo, in collaboration with Felipe de Alba, at CESOP (Center of Social Studies and Public Opinion, at the Mexican Legislative Assembly). The data include geospatial time series data regarding changes in urban land cover, digitized electoral results, land titling, land zoning, and public housing. Additional funding for this work was provided by NSF under Grant No. 1414052, CNH: The Dynamics of Multiscalar Adaptation in Megacities (PI H. Eakin), and the NSF-CONACYT GROW fellowship NSF No. 026257-001 and CONACYT number 291303 (PI Bojórquez). References: Tellman, B., Eakin, H., Janssen, M.A., Alba, F. De, Ii, B.L.T., 2021. The Role of Institutional Entrepreneurs and Informal Land Transactions in Mexico City’s Urban Expansion. World Dev. 140, 1–44. https://doi.org/10.1016/j.worlddev.2020.105374 Goldblatt, R., Stuhlmacher, M.F., Tellman, B., Clinton, N., Hanson, G., Georgescu, M., Wang, C., Serrano-Candela, F., Khandelwal, A.K., Cheng, W.-H., Balling, R.C., 2018. Using Landsat and nighttime lights for supervised pixel-based image classification of urban land cover. Remote Sens. Environ. 205, 253–275. https://doi.org/10.1016/j.rse.2017.11.026 More>>
<p>This data set for the manuscript entitled "Design of Peptides that Fold and Self-Assemble on Graphite" includes all files needed to run and analyze the simulations described in the this manuscript in the molecular dynamics software NAMD, as well as the output of the simulations. The files are organized into directories corresponding to the figures of the main text and supporting information. They include molecular model structure files (NAMD psf or Amber prmtop format), force field parameter files (in CHARMM format), initial atomic coordinates (pdb format), NAMD configuration files, Colvars configuration files, NAMD log files, and NAMD output including restart files (in binary NAMD format) and trajectories in dcd format (downsampled to 10 ns per frame). Analysis is controlled by shell scripts (Bash-compatible) that call VMD Tcl scripts or python scripts. These scripts and their output are also included.</p> <p>Version: 2.0</p> <p>Changes versus version 1.0 are the addition of the free energy of folding, adsorption, and pairing calculations (Sim_Figure-7) and shifting of the figure numbers to accommodate this addition.</p> <p><br /> Conventions Used in These Files<br /> ===============================</p> <p>Structure Files<br /> ----------------<br /> - graph_*.psf or sol_*.psf (original NAMD (XPLOR?) format psf file including atom details (type, charge, mass),
as well as definitions of bonds, angles, dihedrals, and impropers for each dipeptide.)</p> <p>- graph_*.pdb or sol_*.pdb (initial coordinates before equilibration)<br /> - repart_*.psf (same as the above psf files, but the masses of non-water hydrogen atoms have been repartitioned by VMD script repartitionMass.tcl)<br /> - freeTop_*.pdb (same as the above pdb files, but the carbons of the lower graphene layer have been placed at a single z value and marked for restraints in NAMD)<br /> - amber_*.prmtop (combined topology and parameter files for Amber force field simulations)<br /> - repart_amber_*.prmtop (same as the above prmtop files, but the masses of non-water hydrogen atoms have been repartitioned by ParmEd)</p> <p>Force Field Parameters<br /> ----------------------<br /> CHARMM format parameter files:<br /> - par_all36m_prot.prm (CHARMM36m FF for proteins)<br /> - par_all36_cgenff_no_nbfix.prm (CGenFF v4.4 for graphene) The NBFIX parameters are commented out since they are only needed for aromatic halogens and we use only the CG2R61 type for graphene.<br /> - toppar_water_ions_prot_cgenff.str (CHARMM water and ions with NBFIX parameters needed for protein and CGenFF included and others commented out)</p> <p>Template NAMD Configuration Files<br /> ---------------------------------<br /> These contain the most commonly used simulation parameters. They are called by the other NAMD configuration files (which are in the namd/ subdirectory):<br /> - template_min.namd (minimization)<br /> - template_eq.namd (NPT equilibration with lower graphene fixed)<br /> - template_abf.namd (for adaptive biasing force)</p> <p>Minimization<br /> -------------<br /> - namd/min_*.0.namd</p> <p>Equilibration<br /> -------------<br /> - namd/eq_*.0.namd</p> <p>Adaptive biasing force calculations<br /> -----------------------------------<br /> - namd/eabfZRest7_graph_chp1404.0.namd<br /> - namd/eabfZRest7_graph_chp1404.1.namd (continuation of eabfZRest7_graph_chp1404.0.namd)</p> <p>Log Files<br /> ---------<br /> For each NAMD configuration file given in the last two sections, there is a log file with the same prefix, which gives the text output of NAMD. For instance, the output of namd/eabfZRest7_graph_chp1404.0.namd is eabfZRest7_graph_chp1404.0.log.</p> <p>Simulation Output<br /> -----------------<br /> The simulation output files (which match the names of the NAMD configuration files) are in the output/ directory. Files with the extensions .coor, .vel, and .xsc are coordinates in NAMD binary format, velocities in NAMD binary format, and extended system information (including cell size) in text format. Files with the extension .dcd give the trajectory of the atomic coorinates over time (and also include system cell information). Due to storage limitations, large DCD files have been omitted or replaced with new DCD files having the prefix stride50_ including only every 50 frames. The time between frames in these files is 50 * 50000 steps/frame * 4 fs/step = 10 ns. The system cell trajectory is also included for the NPT runs are output/eq_*.xst.</p> <p>Scripts<br /> -------<br /> Files with the .sh extension can be found throughout. These usually provide the highest level control for submission of simulations and analysis. Look to these as a guide to what is happening. If there are scripts with step1_*.sh and step2_*.sh, they are intended to be run in order, with step1_*.sh first.</p> <p><br /> CONTENTS<br /> ========</p> <p>The directory contents are as follows. The directories Sim_Figure-1 and Sim_Figure-8 include README.txt files that describe the files and naming conventions used throughout this data set.</p> <p>Sim_Figure-1: Simulations of N-acetylated C-amidated amino acids (Ac-X-NHMe) at the graphite–water interface.</p> <p>Sim_Figure-2: Simulations of different peptide designs (including acyclic, disulfide cyclized, and N-to-C cyclized) at the graphite–water interface.</p> <p>Sim_Figure-3: MM-GBSA calculations of different peptide sequences for a folded conformation and 5 misfolded/unfolded conformations.</p> <p>Sim_Figure-4: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.</p> <p>Sim_Figure-5: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 295 K.</p> <p>Sim_Figure-5_replica: Temperature replica exchange molecular dynamics simulations for the peptide cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) with 20 replicas for temperatures from 295 to 454 K.</p> <p>Sim_Figure-6: Simulation of the peptide molecule cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) in free solution (no graphite).</p> <p>Sim_Figure-7: Free energy calculations for folding, adsorption, and pairing for the peptide CHP1404 (sequence: cyc(GTGSGTG-GPGG-GCGTGTG-SGPG)). For folding, we calculate the PMF as function of RMSD by replica-exchange umbrella sampling (in the subdirectory Folding_CHP1404_Graphene/). We make the same calculation in solution, which required 3 seperate replica-exchange umbrella sampling calculations (in the subdirectory Folding_CHP1404_Solution/). Both PMF of RMSD calculations for the scrambled peptide are in Folding_scram1404/. For adsorption, calculation of the PMF for the orientational restraints and the calculation of the PMF along z (the distance between the graphene sheet and the center of mass of the peptide) are in Adsorption_CHP1404/ and Adsorption_scram1404/. The actual calculation of the free energy is done by a shell script ("doRestraintEnergyError.sh") in the 1_free_energy/ subsubdirectory. Processing of the PMFs must be done first in the 0_pmf/ subsubdirectory. Finally, files for free energy calculations of pair formation for CHP1404 are found in the Pair/ subdirectory.</p> <p>Sim_Figure-8: Simulation of four peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) where the peptides are far above the graphene–water interface in the initial configuration.</p> <p>Sim_Figure-9: Two replicates of a simulation of nine peptide molecules with the sequence cyc(GTGSGTG-GPGG-GCGTGTG-SGPG) at the graphite–water interface at 370 K.</p> <p>Sim_Figure-9_scrambled: Two replicates of a simulation of nine peptide molecules with the control sequence cyc(GGTPTTGGGGGGSGGPSGTGGC) at the graphite–water interface at 370 K.</p> <p>Sim_Figure-10: Adaptive biasing for calculation of the free energy of the folded peptide as a function of the angle between its long axis and the zigzag directions of the underlying graphene sheet.</p> <p> </p>
Other
This material is based upon work supported by the US National Science Foundation under grant no. DMR-1945589. A majority of the computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CHE-1726332, CNS-1006860, EPS-1006860, and EPS-0919443. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562, through allocation BIO200030. More>>
Between 2018 and 2021 PIs for National Science Foundation Awards # 1758781 and 1758814 EAGER: Collaborative Research: Developing and Testing an Incubator for Digital Entrepreneurship in Remote Communities, in partnership with the Tanana Chiefs Conference, the traditional tribal consortium of the 42 villages of Interior Alaska, jointly developed and conducted large-scale digital and in-person surveys of multiple Alaskan interior communities. The survey was distributed via a combination of in-person paper surveys, digital surveys, social media links, verbal in-person interviews and telephone-based responses. Analysis of this measure using SAS demonstrated the statistically significant need for enhanced digital infrastructure and reworked digital entrepreneurial and technological education in the Tanana Chiefs Conference region. 1. Two statistical measures were created during this research: Entrepreneurial Readiness (ER) and Digital Technology needs and skills (DT), both of which showed high measures of internal consistency (.89, .81). 2. The measures revealed entrepreneurial readiness challenges and evidence of specific addressable barriers that are currently preventing (serving as hindrances) to regional digital economic activity. The survey data showed statistically significant correlation with the mixed-methodological in-person focus groups and interview research conducted by the PIs and TCC collaborators in Hughes and Huslia, AK, which further corroborated stated barriers to
entrepreneurship development in the region. 3. Data generated by the survey and fieldwork is maintained by the Tanana Chiefs Conference under data sovereignty agreements. The survey and focus group data contains aggregated statistical/empirical data as well as qualitative/subjective detail that runs the risk of becoming personally identifiable especially due to (but not limited to) to concerns with exceedingly small Arctic community population sizes. 4. This metadata is being provided in order to serve as a record of the data collection and analysis conducted, and also to share some high-level findings that, while revealing no personal information, may be helpful for policymaking, regional planning and efforts towards educational curricular development and infrastructural investment. The sample demographics consist of 272 women, 79 men, and 4 with gender not indicated as a response. Barriers to Entrepreneurial Readiness were a component of the measure. Lack of education is the #1 barrier, followed closely by lack of access to childcare. Among women who participated in the survey measure, 30% with 2 or more children report lack of childcare to be a significant barrier to entrepreneurial and small business activity. For entrepreneurial readiness and digital economy, the scales perform well from a psychometric standpoint. The summary scores are roughly normally distributed. Cronbach’s alphas are greater than 0.80 for both. They are moderately correlated with each other (r = 0.48, p &amp;amp;lt; .0001). Men and women do not differ significantly on either measure. Education is significantly related to the digital economy measure. The detail provided in the survey related to educational needs enabled optimized development of the Incubator for Digital Entrepreneurship in Remote Communities. Enhanced digital entrepreneurship training with clear cultural linkages to traditions and community needs, along with additional childcare opportunities are two among several specific recommendations provided to the TCC. The project PIs are working closely with the TCC administration and community members related to elements of culturally-aligned curricular development that respects data tribal sovereignty, local data management protocols, data anonymity and adherence to human subjects (IRB) protocols. While the survey data is currently embargoed and unable to be submitted publicly for reasons of anonymity, the project PIs are working with the NSF Arctic Data Center towards determining pathways for sharing personally-protected data with the larger scientific community. These approaches may consist of aggregating and digitally anonymizing sensitive data in ways that cannot be de-aggregated and that meet agency and scientific community needs (while also fully respecting and protecting participants’ rights and personal privacy). At present the data sensitivity protocols are not yet adapted to TCC requirements and the datasets will remain in their care. More>>
<p>This dataset contains machine learning and volunteer classifications from the Gravity Spy project. It includes glitches from observing runs O1, O2, O3a and O3b that received at least one classification from a registered volunteer in the project. It also indicates glitches that are nominally retired from the project using our default set of retirement parameters, which are described below. See more details in the Gravity Spy Methods paper. </p> <p>When a particular subject in a citizen science project (in this case, glitches from the LIGO datastream) is deemed to be classified sufficiently it is "retired" from the project. For the Gravity Spy project, retirement depends on a combination of both volunteer and machine learning classifications, and a number of parameterizations affect how quickly glitches get retired. For this dataset, we use a default set of retirement parameters, the most important of which are: </p> <ol><li>A glitches must be classified by at least 2 registered volunteers</li><li>Based on both the initial machine learning classification and volunteer classifications, the glitch has more than a 90% probability of residing in a particular class</li><li>Each volunteer classification (weighted by that volunteer's confusion matrix) contains a weight equal to the initial machine learning score when determining the final probability</li></ol> <p>The choice of these and other
parameterization will affect the accuracy of the retired dataset as well as the number of glitches that are retired, and will be explored in detail in an upcoming publication (Zevin et al. in prep). </p> <p>The dataset can be read in using e.g. Pandas: <br /> ```<br /> import pandas as pd<br /> dataset = pd.read_hdf('retired_fulldata_min2_max50_ret0p9.hdf5', key='image_db')<br /> ```<br /> Each row in the dataframe contains information about a particular glitch in the Gravity Spy dataset. </p> <p><strong>Description of series in dataframe</strong></p> <ul><li>['1080Lines', '1400Ripples', 'Air_Compressor', 'Blip', 'Chirp', 'Extremely_Loud', 'Helix', 'Koi_Fish', 'Light_Modulation', 'Low_Frequency_Burst', 'Low_Frequency_Lines', 'No_Glitch', 'None_of_the_Above', 'Paired_Doves', 'Power_Line', 'Repeating_Blips', 'Scattered_Light', 'Scratchy', 'Tomte', 'Violin_Mode', 'Wandering_Line', 'Whistle'] <ul><li>Machine learning scores for each glitch class in the trained model, which for a particular glitch will sum to unity</li></ul> </li><li>['ml_confidence', 'ml_label'] <ul><li>Highest machine learning confidence score across all classes for a particular glitch, and the class associated with this score</li></ul> </li><li>['gravityspy_id', 'id'] <ul><li>Unique identified for each glitch on the Zooniverse platform ('gravityspy_id') and in the Gravity Spy project ('id'), which can be used to link a particular glitch to the full Gravity Spy dataset (which contains GPS times among many other descriptors)</li></ul> </li><li>['retired'] <ul><li>Marks whether the glitch is retired using our default set of retirement parameters (1=retired, 0=not retired)</li></ul> </li><li>['Nclassifications'] <ul><li>The total number of classifications performed by registered volunteers on this glitch</li></ul> </li><li>['final_score', 'final_label'] <ul><li>The final score (weighted combination of machine learning and volunteer classifications) and the most probable type of glitch</li></ul> </li><li>['tracks'] <ul><li>Array of classification weights that were added to each glitch category due to each volunteer's classification</li></ul> </li></ul> <p> </p> <p>```<br /> For machine learning classifications on all glitches in O1, O2, O3a, and O3b, please see Gravity Spy Machine Learning Classifications on Zenodo</p> <p>For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.</p> <p>For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo. </p> More>>
@article{osti_10377920,
place = {Country unknown/Code not available},
title = {Madidi Project Full Dataset},
url = {https://par.nsf.gov/biblio/10377920},
DOI = {10.5281/zenodo.5160379},
abstractNote = {{"Abstract":["This item contains version 5.0<\/strong> of the Madidi Project's full dataset. The zip file contains (1) raw data, which was downloaded from Tropicos (www.tropicos.org) on August 18, 2020; (2) R scripts used to modify, correct, and clean the raw data; (3) clean data that are the output of the R scripts, and which are the point of departure for most uses of the Madidi Dataset; (4) post-cleaning scripts that obtain additional but non-essential information from the clean data (e.g. by extracting environmental data from rasters); and (5) a miscellaneous collection of additional non-essential information and figures. This item also includes the Data Use Policy<\/strong> for this dataset.<\/p>\n\nThe core dataset of the Madidi Project consists of a network of ~500 forest plots distributed in and around the Madidi National Park in Bolivia. This network contains 50 permanently marked large plots (1-ha), as well as >450 temporary small plots (0.1-ha). Within the large plots, all woody individuals with a dbh ≥10 cm have been mapped, tagged, measured, and identified. Some of these plots have also been re-visited and information on mortality, recruitment, and growth exists. Within the small plots, all woody individuals with a dbh ≥2.5 cm have been measured and identified. Each plot has some edaphic and topographic information, and some large plots have information on various plant functional traits.<\/p>\n\nThe Madidi Project is a collaborative research effort to document and study plant biodiversity in the Amazonia and Tropical Andes of northwestern Bolivia. The project is currently lead by the Missouri Botanical Garden (MBG), in collaboration with the Herbario Nacional de Bolivia. The management of the project is at MBG, where J. Sebastian Tello (sebastian.tello@mobot.org) is the scientific director. The director oversees the activities of a research team based in Bolivia. MBG works in collaboration with other data contributors (currently: Manuel J. Macía [manuel.macia@uam.es], Gabriel Arellano [gabriel.arellano.torres@gmail.com] and Beatriz Nieto [sonneratia@gmail.com]), with a representative from the Herbario Nacional de Bolivia (LPB; Carla Maldonado [carla.maldonado1@gmail.com]), as well as with other close associated researchers from various institutions. For more information regarding the organization and objectives of the Madidi Project, you can visit the project\u2019s website (www.madidiproject.weebly.com<\/strong>).<\/p>"],"Other":["The Madidi project has been supported by generous grants from the National Science Foundation (DEB 0101775, DEB 0743457, DEB 1836353), and the National Geographic Society (NGS 7754-04 and NGS 8047-06). Additional financial support for the Madidi Project has been provided by the Missouri Botanical Garden, the Comunidad de Madrid (Spain), the Universidad Autónima de Madrid, and the Taylor and Davidson families."]}},
journal = {},
publisher = {Zenodo},
author = {Tello, J. Sebastian and Macia, Manuel J. and Arellano, Gabriel and Nieto-Ariza, Beatriz and Cayola, Leslie and Fuentes, Alfredo F.},
}