The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.
Explore Research Products in the PAR It may take a few hours for recently added research products to appear in PAR search results.
{"Abstract":["Abstract<\/strong><\/p>\n\nThe NeonTreeCrowns dataset is a set of individual level crown estimates for 100 million trees at 37 geographic sites across the United States surveyed by the National Ecological Observation Network\u2019s Airborne Observation Platform. Each rectangular bounding box crown prediction includes height, crown area, and spatial location. <\/p>\n\nHow can I see the data?<\/strong><\/p>\n\nA web server to look through predictions is available through idtrees.org<\/p>\n\nDataset Organization<\/strong><\/p>\n\nThe shapefiles.zip contains 11,000 shapefiles, each corresponding to a 1km^2 RGB tile from NEON (ID: DP3.30010.001). For example "2019_SOAP_4_302000_4100000_image.shp" are the predictions from "2019_SOAP_4_302000_4100000_image.tif" available from the NEON data portal: https://data.neonscience.org/data-products/explore?search=camera. NEON's file convention refers to the year of data collection (2019), the four letter site code (SOAP), the sampling event (4), and the utm coordinate of the top left corner (302000_4100000). For NEON site abbreviations and utm zones see https://www.neonscience.org/field-sites/field-sites-map. <\/p>\n\nThe predictions are also available as a single csv for each file. All available tiles for that site and year are combined into one large site. These data are not projected, but contain the utm coordinates for each bounding box (left, bottom, right, top). For both file types the following fields are available:<\/p>\n\nHeight: The crown height measured in meters. Crown height is defined as the 99th quartile of all canopy height pixels from a LiDAR height model (ID: DP3.30015.001)<\/p>\n\nArea: The crown area in m2<\/sup> of the rectangular bounding box.<\/p>\n\nLabel: All data in this release are "Tree".<\/p>\n\nScore: The confidence score from the DeepForest deep learning algorithm. The score ranges from 0 (low confidence) to 1 (high confidence)<\/p>\n\nHow were predictions made?<\/strong><\/p>\n\nThe DeepForest algorithm is available as a python package: https://deepforest.readthedocs.io/. Predictions were overlaid on the LiDAR-derived canopy height model. Predictions with heights less than 3m were removed.<\/p>\n\nHow were predictions validated?<\/strong><\/p>\n\nPlease see<\/p>\n\nWeinstein, B. G., Marconi, S., Bohlman, S. A., Zare, A., & White, E. P. (2020). Cross-site learning in deep learning RGB tree crown detection. Ecological Informatics<\/em>, 56<\/em>, 101061.<\/p>\n\nWeinstein, B., Marconi, S., Aubry-Kientz, M., Vincent, G., Senyondo, H., & White, E. (2020). DeepForest: A Python package for RGB deep learning tree crown delineation. bioRxiv<\/em>.<\/p>\n\nWeinstein, Ben G., et al. "Individual tree-crown detection in RGB imagery using semi-supervised deep learning neural networks." Remote Sensing<\/em> 11.11 (2019): 1309.<\/p>\n\nWere any sites removed?<\/strong><\/p>\n\nSeveral sites were removed due to poor NEON data quality. GRSM and PUUM both had lower quality RGB data that made them unsuitable for prediction. NEON surveys are updated annually and we expect future flights to correct these errors. We removed the GUIL puerto rico site due to its very steep topography and poor sunangle during data collection. The DeepForest algorithm responded poorly to predicting crowns in intensely shaded areas where there was very little sun penetration. We are happy to make these data are available upon request.<\/p>\n\n# Contact<\/p>\n\nWe welcome questions, ideas and general inquiries. The data can be used for many applications and we look forward to hearing from you. Contact ben.weinstein@weecology.org. <\/p>"],"Other":["Gordon and Betty Moore Foundation: GBMF4563"]}more »« less
# Individual Tree Predictions for 100 million trees in the National Ecological Observatory Network Preprint: https://www.biorxiv.org/content/10.1101/2023.10.25.563626v1 ## Manuscript Abstract The ecology of forest ecosystems depends on the composition of trees. Capturing fine-grained information on individual trees at broad scales allows an unprecedented view of forest ecosystems, forest restoration and responses to disturbance. To create detailed maps of tree species, airborne remote sensing can cover areas containing millions of trees at high spatial resolution. Individual tree data at wide extents promises to increase the scale of forest analysis, biogeographic research, and ecosystem monitoring without losing details on individual species composition and abundance. Computer vision using deep neural networks can convert raw sensor data into predictions of individual tree species using ground truthed data collected by field researchers. Using over 40,000 individual tree stems as training data, we create landscape-level species predictions for over 100 million individual trees for 24 sites in the National Ecological Observatory Network. Using hierarchical multi-temporal models fine-tuned for each geographic area, we produce open-source data available as 1km^2 shapefiles with individual tree species prediction, as well as crown location, crown area and height of 81 canopy tree species. Site-specific models had an average performance of 79% accuracy covering an average of six species per site, ranging from 3 to 15 species. All predictions were uploaded to Google Earth Engine to benefit the ecology community and overlay with other remote sensing assets. These data can be used to study forest macro-ecology, functional ecology, and responses to anthropogenic change. ## Data Summary Each NEON site is a single zip archive with tree predictions for all available data. For site abbreviations see: https://www.neonscience.org/field-sites/explore-field-sites. For each site, there is a .zip and .csv. The .zip is a set 1km .shp tiles. The .csv is all trees in a single file. ## Prediction metadata *Geometry* A four pointed bounding box location in utm coordinates. *indiv_id* A unique crown identifier that combines the year, site and geoindex of the NEON airborne tile (e.g. 732000_4707000) is the utm coordinate of the top left of the tile. *sci_name* The full latin name of predicted species aligned with NEON's taxonomic nomenclature. *ens_score* The confidence score of the species prediction. This score is the output of the multi-temporal model for the ensemble hierarchical model. *bleaf_taxa* Highest predicted category for the broadleaf submodel *bleaf_score* The confidence score for the broadleaf taxa submodel *oak_taxa* Highest predicted category for the oak model *dead_label* A two class alive/dead classification based on the RGB data. 0=Alive/1=Dead. *dead_score* The confidence score of the Alive/Dead prediction. *site_id* The four letter code for the NEON site. See https://www.neonscience.org/field-sites/explore-field-sites for site locations. *conif_taxa* Highest predicted category for the conifer model *conif_score* The confidence score for the conifer taxa submodel *dom_taxa* Highest predicted category for the dominant taxa mode submodel *dom_score* The confidence score for the dominant taxa submodel ## Training data The crops.zip contains pre-cropped files. 369 band hyperspectral files are numpy arrays. RGB crops are .tif files. Naming format is __, for example. "NEON.PLA.D07.GRSM.00583_2022_RGB.tif" is RGB crop of the predicted crown of NEON data from Great Smoky Mountain National Park (GRSM), flown in 2022.Along with the crops are .csv files for various train-test split experiments for the manuscript. ### Crop metadata There are 30,042 individuals in the annotations.csv file. We keep all data, but we recommend a filtering step of atleast 20 records per species to reduce chance of taxonomic or data cleaning errors. This leaves 132 species. *score* This was the DeepForest crown score for the crop. *taxonID*For letter species code, see NEON plant taxonomy for scientific name: https://data.neonscience.org/taxonomic-lists *individual*unique individual identifier for a given field record and crown crop *siteID*The four letter code for the NEON site. See https://www.neonscience.org/field-sites/explore-field-sites for site locations. *plotID* NEON plot ID within the site. For more information on NEON sampling see: https://www.neonscience.org/data-samples/data-collection/observational-sampling/site-level-sampling-design *CHM_height* The LiDAR derived height for the field sampling point. *image_path* Relative pathname for the hyperspectral array, can be read by numpy.load -> format of 369 bands * Height * Weight *tile_year* Flight year of the sensor data *RGB_image_path* Relative pathname for the RGB array, can be read by rasterio.open() # Code repository The predictions were made using the DeepTreeAttention repo: https://github.com/weecology/DeepTreeAttentionKey files include model definition for a [single year model](https://github.com/weecology/DeepTreeAttention/blob/main/src/models/Hang2020.py) and [Data preprocessing](https://github.com/weecology/DeepTreeAttention/blob/cae13f1e4271b5386e2379068f8239de3033ec40/src/utils.py#L59).
Weinstein, Ben G; Marconi, Sergio; Zare, Alina; Bohlman, Stephanie A; Singh, Aditya; Graves, Sarah J; Magee, Lukas; Johnson, Daniel J; Record, Sydne; Rubio, Vanessa E; et al
(, PLOS Biology)
Tanentzap, Andrew J
(Ed.)
The ecology of forest ecosystems depends on the composition of trees. Capturing fine-grained information on individual trees at broad scales provides a unique perspective on forest ecosystems, forest restoration, and responses to disturbance. Individual tree data at wide extents promises to increase the scale of forest analysis, biogeographic research, and ecosystem monitoring without losing details on individual species composition and abundance. Computer vision using deep neural networks can convert raw sensor data into predictions of individual canopy tree species through labeled data collected by field researchers. Using over 40,000 individual tree stems as training data, we create landscape-level species predictions for over 100 million individual trees across 24 sites in the National Ecological Observatory Network (NEON). Using hierarchical multi-temporal models fine-tuned for each geographic area, we produce open-source data available as 1 km2shapefiles with individual tree species prediction, as well as crown location, crown area, and height of 81 canopy tree species. Site-specific models had an average performance of 79% accuracy covering an average of 6 species per site, ranging from 3 to 15 species per site. All predictions are openly archived and have been uploaded to Google Earth Engine to benefit the ecology community and overlay with other remote sensing assets. We outline the potential utility and limitations of these data in ecology and computer vision research, as well as strategies for improving predictions using targeted data sampling.
Joyce, Francis H; Zahawi, Rakan A; Holl, Karen D
(, Dryad)
Recovery of tree community composition in restored tropical forests relies on successful recruitment of later-successional species. However, the long-term effects of different restoration interventions on establishment success of arriving seeds are poorly understood. We evaluated the effects of three restoration treatments on the seed-to-seedling transition for later-successional tree species in a fragmented agricultural landscape in southern Costa Rica. Restoration plots (0.25 ha) were established in a block design nearly two decades prior and spanned a gradient of intervention intensity: natural regeneration (not planted), applied nucleation (planted tree clusters), and plantation (fully planted). We conducted seed addition experiments from 2021-2023 using eight species at seven replicate restoration sites and in four nearby remnant forests. Study region We conducted this study from June 2021 to September 2023 in southern Costa Rica near Las Cruces Biological Station. The forests in this region are at the boundary between Tropical Premontane Wet and Rain Forest zones (Holdridge et al., 1971), and the study sites range from ~1100 to 1200 m in elevation. Soils are volcanic in origin, mildly acidic (pH ~5.5), low in P, and high in organic matter. Mean annual temperature is ~21°C, and annual precipitation is between 2700 and 5000 mm, with a dry season from December to March. The landscape was largely deforested between 1960 and 1980 and is now a fragmented mosaic of remnant forest, secondary forest, cattle pastures and agricultural fields. Design of long-term restoration experiment We used seven restoration sites set up in 2004-2006 (Figure 1), a subset of the 15 sites described in Holl et al. (2020). All sites were previously used for ≥ 18 years for grazing or coffee production, and most are steeply sloped (15-35º). Minimum distance between sites is 700 m. Each restoration site had three 50 × 50 m (0.25 ha) plots: natural regeneration, applied nucleation, and plantation. Natural regeneration plots were not planted, whereas applied nucleation and plantation plots were planted with seedlings of four tree species: Erythrina poeppigiana (Walp.) Skeels, Ingaedulis Mart., Terminalia amazonia (J.F. Gmel.) Exell, and Vochysia guatemalensis Donn. Sm. E. poeppigiana and I. edulis are naturalized, fast-growing N-fixers commonly used in agricultural intercropping, whereas T. amazonia and V. guatemalensis are native timber species. In applied nucleation plots trees were planted in six nuclei or patches, two each of 4 × 4, 8 × 8, and 12 × 12 m, whereas plantation plotswereplanted uniformly throughout; the number of trees planted in applied nucleation plots was 27% of the total in plantation plots. Four sites also have paired remnant forest (2 to >300 ha) located 5-50 m from restoration plots, which serve as a reference system for comparing the conditions and ecological processes of restoration plots. On average, percent canopy cover was lower and cover of exotic pasture grasses was much greater in natural regeneration than planted treatments (Kulikowski et al., 2023). Seed addition experiment We selected eight tree species (Table 1; Figure S1) for seed addition experiments, based on: (a) their later-successional status, categorized by expert opinion as occurring exclusively in late-successional forest or both early- and late-successional forest (Schubert et al., 2025); (b) a continuum of seed sizes; and (c) regional availability of seeds (>700) for the experiment. Study species were all animal-dispersed and ranged from 2.4 to 24.8 mm seed width. We collected seeds from at least three mother trees per species, encountered in conditions that differed by species: (1) mature fruits/nuts from the ground (Pseudolmedia mollis, Quercus benthamii, Trophis mexicana, and some Erythroxylum macrophyllum and Otoba novogranatensis),(2) seeds with pulp already removed by birds from the ground (Ocoteapuberula and some O. novogranatensis), and (3) ripe fruit directly from trees (Lacistema aggregatum, Palicourea padifolia, some E. macrophyllum).We manually removed any fruit pulp from seeds and briefly submerged all seeds in water to check for insect damage. Seeds that floated were assumed to be non-viable and not used. We measured fresh mass and width for ≥50 seeds per species (Table 1; Figure S1). We thoroughly mixed conspecific seeds to avoid bias in seed source or quality among sites or habitat types (hereafter ‘treatments’). We conducted concurrent germination trials to confirm that no species completely failed to germinate in a nursery setting. Within each restoration plot and remnant forest, we established four sets (hereafter ‘stations’) of three 1-m2 quadrats separated by ≥15 m. In natural regeneration, plantation, and reference forest plots, one station was located in each quarter of the plot (Figure 2a). In applied nucleation plots, stations were systematically distributed with one station at each of four positions relative to the initial planting design (Figure 2b): (1) within a large nucleus; (2) at the edge of a medium nucleus; (3) between two nuclei, and (4) far from any nucleus. Nuclei have spread considerably beyond the planted area due to crown expansion and natural recruitment; however, these positions spanned a representative range of microsite conditions. Two quadrats per station were used for seed additions to accommodate all eight species, and one non-sown control quadrat was surveyed to assess background abundance of naturally recruited seedlings of the study species (Figure 2c). Each control quadrat was located 2 m from a seed addition quadrat and on the uphill side if on a slope. We placed quadrats to avoid large tree trunks but did not remove existing vegetation or leaf litter. At each station, we sequentially added seeds of three species (P. mollis, O. novogranatensis, and O. puberula) to one quadrat (26 total seeds m⁻²) and five species (Q. benthamii, E. macrophyllum, L. aggregatum, T. mexicana, and P. padifolia) to the second quadrat (47 total seeds m⁻²) as seeds of each species became available between July 2021 and August 2022. The number of seeds added per quadrat ranged from 7 to 11 seeds per species (Table 1) because of varying seed availability. Realized densities of total added individuals were lower than the cumulative densities because of (a) temporal staggering of additions by species and (b) seed mortality. We placed seeds on top of the litter or soil to simulate how seeds would naturally be deposited from primary dispersal, slightly pressing down to minimize rolling on steeper slopes. We secured a roll of fine mesh ~5 cm high on the downhill side of each seed addition quadrat to catch seeds washed downslope by runoff. Seeds were placed in rows with 10-20 cm minimum spacing and assigned an identifier based on grid position. For the five smallest-seeded species, locations were marked with popsicle sticks to facilitate monitoring. We monitored seedling emergence, survival, and height (to nearest 0.5 cm) for each species at least four times. Census intervals were timed to: (a) to capture seedling emergence approximately 1-2 months post seed addition and every 2-4 months thereafter, and (b) measure the first-year survival percentage of emerged seedlings 12-15 months after seed addition. Because some seedlings may have emerged and died between censuses or prior to the first census, we may have underestimated emergence and overestimated seedling survival. This uncertainty was mitigated by our observations of obviously dead seeds (indicating failure to emerge) or dead seedlings (indicating emergence but subsequent mortality). For the four larger-seeded species for which ungerminated seeds could be reliably re-found, missing seeds were interpreted as pre-emergence mortality. Assuming that missing seeds were predated was reasonable because a companion study found that almost all seeds removed by vertebrates were eventually predated (Joyce et al., 2024). Variable durations of the “first year” period are typical in tropical tree seed addition experiments involving multiple species with differing phenologies (e.g. Svenning & Wright, 2005). One of the seven sites where we added seeds in 2021 was then cleared by the landowner, so in 2022 we added seeds of the remaining six species to an alternate site. The other six sites were used for the duration of the study. Accordingly, P. mollis and Q. benthamii, were tested at six sites rather than seven. We did not tag seedlings but were able to distinguish between natural recruits and seedlings that recruited from added seeds based on the location of seedlings relative to the grid and the size of seedlings compared to the experimental cohort. We sampled the non-sown control quadrat at each station at a single timepoint (July 2023) for seedlings similarly sized to our added individuals. References Holdridge, L. R., Grenke, W. C., Hatheway, W. H., Liany, T., & Tosi Jr, J. A. (1971). Forest environments in tropical life zones: A pilot study. Pergamon Press. Holl, K. D., Reid, J. L., Cole, R. J., Oviedo‐Brenes, F., Rosales, J. A., & Zahawi, R. A. (2020). Applied nucleation facilitates tropical forest recovery: Lessons learned from a 15-year study. Journal of Applied Ecology, 57(12). https://doi.org/10.1111/1365-2664.13684 Joyce, F. H., Ramos, B. M., Zahawi, R. A., & Holl, K. D. (2024). Vertebrate seed predation can limit recruitment of later-successional species in tropical forest restoration. Biotropica, 56(6), e13381. https://doi.org/10.1111/btp.13381 Kulikowski, A. J., Zahawi, R. A., Werden, L. K., Zhu, K., & Holl, K. D. (2023). Restoration interventions mediate tropical tree recruitment dynamics over time. Philosophical Transactions of the Royal Society B: Biological Sciences, 378(1867), 20210077. https://doi.org/10.1098/rstb.2021.0077 Schubert, S. C., Zahawi, R. A., Oviedo-Brenes, F., Rosales, J. A., & Holl, K. D. (2025). Active restoration increases tree species richness and recruitment of large-seeded taxa after 16–18 years. Ecological Applications, 35(1), e3053. https://doi.org/10.1002/eap.3053 Svenning, J.-C., & Wright, S. J. (2005). Seed limitation in a Panamanian forest. Journal of Ecology, 93(5), 853–862. https://doi.org/10.1111/j.1365-2745.2005.01016.x # Data from: Lower-intensity restoration interventions drive greater seedling establishment for later-successional tree species Dataset DOI: [10.5061/dryad.cfxpnvxjp](10.5061/dryad.cfxpnvxjp) ## Description of the data and file structure Data from a field experiment in which seeds of eight later-successional tree species were added to restoration plots and remnant forests in southern Costa Rica to assess barriers to seedling establishment. ### Files and variables #### File: canopy\_cover.csv **Description:** canopy cover above each seed addition quadrat, calculated from four densiometer measurements ##### Variables * plot_id: (character) Unique identifier for each plot, combining site and treatment (see below). * site: (character) two-letter code identifying each study site * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * station: factor (integer 1-4) identifying the set of plots (“station”) within each plot. * station_id: (character) uniquely identifying each station; combination of site, treatment, and station. * quadrat: character variable identifying the quadrat within each station * quadrat_id: (character) unique identifier for each seed addition quadrat * densi_1 to densi_4 : Four numeric variables with raw densiometer measurements (number of densiometer points not open sky, range 0-96). * mean_densi: (numeric) mean of densi_1, densi_2, densi_3, and densi_4 * canopy_cover: mean_densi measurement converted to percent canopy cover (mean_densi/96*100), rounded to 1 decimal point #### File: litter\_depth.csv **Description:** leaf litter (or thatch) depth in each seed addition quadrat ##### Variables * plot_id: (character) Unique identifier for each plot, combining site and treatment (see below). * station_id: (character) Unique identifier for each station * quadrat_id: (character) unique identifier for each seed addition quadrat * site: (character) two-letter code identifying each study site * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * station: factor (integer 1-4) identifying the set of plots (“station”) within each plot. * quadrat: character variable (A or B) identifying the quadrat within each station. * date: date of measurement; format is YYYY-MM-DD * pt1-pt5: litter depth in cm at each of 5 points laid out in an X-shaped pattern within each 1-m quadrat, measured using a marked pin-flag (nearest 0.5 cm). * quadrat_mean_depth: mean of 5 depth measurements in each quadrat (pt1-pt5) #### File: natural\_recruit\_frequency.csv **Description:** Summary of naturally recruiting seedlings of study species recorded in non-seeded quadrats at each station. Each row is a combination of study species and treatment. ##### Variables * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * species: (character) 6-letter code for species * quads_with_recruits: (integer) number of sampled non-seeded quadrats with natural recruits of the specified species * n_quads: (integer) total number of non-seeded quadrats that were sampled for each species and treatment combination * freq: relative frequency of natural recruits within sampled non-seeded quadrats. Equals quads_with_recruits divided by n_quads. #### File: seedling\_emergence.csv **Description:** summary of seedling emergence, where each row is a combination of study species and station ##### Variables * species: (character) 6-letter code for species. Corresponding study species are as follows: * PAL_PAD = *Palicourea padifolia* * TRO_MEX = *Trophis mexicana* * LAC_AGG = *Lacistema aggregatum* * ERY_MAC = *Erythroxylum macrophyllum* * OCO_PUB = *Ocotea puberula* * OTO_NOV = *Otoba novogranatensis* * QUE_BEN = *Quercus benthamii* * PSE_MOL = *Pseudolmedia mollis* * plot_id: (character) Unique identifier for each plot, combining site and treatment (see below). * site: character) two-letter code identifying each study site * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * station: integer (1-4) identifying the set of plots (“station”) within each plot. * station_id: (character) uniquely identifying each station; combination of site, treatment, and station. * added_seeds: (integer) number of conspecific seeds added to each quadrat * live_stems: (integer) number of recorded seedlings that emerged within the quadrat * failed: (integer) number of seeds that failed to emerge. Equal to difference between added_seeds and live_stems. * prop_emerged: proportion of added seeds that emerged. Equal to live_stems/added_seeds. #### File: seedling\_heights.csv **Description:** median seedling height for seedlings of each species surviving to one year in each station ##### Variables * species: (character) 6-letter code for species * site: (character) two-letter code identifying each study site * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * station: integer (1-4) identifying the set of plots (“station”) within each plot. * station_id: (character) uniquely identifying each station; combination of site, treatment, and station. * median_height_cm: (numeric) median height of conspecific seedlings that survived to the end of the 1-year experimental period. NA = not available (used when sample_size = 0; i.e., when no seedlings survived). * sample_size: (integer) number of seedlings that survived to the end of the 1-year period; n for calculation of median_height_cm. #### File: seedling\_survival.csv **Description:** proportion of seedlings surviving to one year for each combination of species and station. ##### Variables * species: (character) 6-letter code for species. * plot_id: (character) Unique identifier for each plot, combining site and treatment (see below). * site: (character) two-letter code identifying each study site * treatment: (character) two-letter code for treatment. NR = natural regeneration, AN = applied nucleation, PL = plantation, RF = reference forest * station: integer (1-4) identifying the set of plots (“station”) within each plot. * station_id: (character) uniquely identifying each station; combination of site, treatment, and station. * germ_seeds: (integer) number of added seeds that emerged * survived_seedlings: (integer) number of seedlings that survived to the end of the ~1-year experimental period. * dead_seedlings: (integer) number of emerged seedlings that died * prop_surv: proportion of emerged seedlings that survived to the end of the ~1-year experimental period, rounded to two decimal places. Equal to survived_seedlings/germ_seeds. ## Code/software Any software capable of opening csv files. ## Access information Other publicly accessible locations of the data: * NA Data was derived from the following sources: * NA
{"Abstract":["Human disturbance can have profound effects on biodiversity, including\n increasing hybridization between reproductively isolated species. One\n approach for understanding how human activity affects hybridization\n dynamics is to evaluate correlations between disturbance (e.g.,\n urbanization, temperature change) and hybridization. Because variation in\n hybridization can also arise from historical factors unrelated to recent\n human disturbance, it is essential to account for population structure to\n avoid spurious correlations. Here, we combine environmental and\n high-coverage whole-genome resequencing data to investigate how human\n disturbance and population structure affect hybridization dynamics between\n a pair of pine sawflies adapted to different pines, Neodiprion lecontei\n and Neodiprion pinetum. We find that N. lecontei and N. pinetum exhibit\n strikingly different patterns of population structure, which we\n hypothesize stems from differences in host. We also find that recent\n admixture is both asymmetric and geographically variable. Linear\n regression analyses reveal that admixture proportion is predicted by\n indirect human disturbance (i.e., climate change) and not direct human\n disturbance (e.g., urbanization) in both N. lecontei and N. pinetum.\n Lastly, in N. pinetum, we find evidence of a spurious association between\n admixture and direct human disturbance that disappears when regression\n models account for population structure via inclusion of genetic principal\n component scores as covariates. Together, our data suggest that indirect\n human disturbance and population structure both contribute to geographic\n variation in admixture between N. lecontei and N. pinetum. Our study also\n highlights the importance of adequately controlling for population\n structure when attempting to identify environmental predictors (human\n disturbance-related or not) of hybridization."],"TechnicalInfo":["# Recent climate change and historical population structure predict\n spatial patterns of admixture between two host-specialized pine sawfly\n species Dataset DOI: [10.5061/dryad.kh18932jw](10.5061/dryad.kh18932jw) ##\n Description of the data and file structure To perform all analyses, run\n scripts provided in the order indicated in the folder/script name (i.e.,\n "01*" to "07*"). NOTE: for all bash scripts (*.sh),\n you will have to modify them to run on your cluster. However, the program\n commands should work as long as you are using the same version of the\n program (program versions indicated in the Materials and Method section of\n the manuscript). ### Files and variables #### File:\n Glover_Linnen_dryad.zip **Description:** Input files required to run\n analyses are provided: 1.\n "allsites_LecPineHetr_filtered_ExclIndvs_minQ20_AllImbHom_refilter_bcftools_norm.vcf.gz" - required for ABBA-BABA analysis ("06_ABBA-BABA" folder); vcf file contains variant and invariant sites for all *N. lecontei*, *N. pinetum*, and *N. hetricki* individuals included in analysis 2. "snps_LecPine_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for ADMIXTURE ("03_ADMIXTURE" folder) and principal component analysis (PCA) for both species combined ("04_run_pcas.sh" script) analyses; vcf file contains linkage-pruned SNPs for all *N. lecontei* and *N. pinetum* individuals included in analyses (i.e., contaminated/putative haploid male individuals removed) 3. "snps_LecOnly_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for *N. lecontei* only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all *N. lecontei* individuals included in analysis (i.e., contaminated/putative haploid male individuals removed) 4. "snps_PineOnly_noHybrid_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for *N. pinetum *only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all *N. pinetum *individuals included in analysis (i.e., contaminated/putative haploid male individuals and F1 hybrid called *N. pinetum* at time of collection due to being collected on white pine removed) 5. "fulldata_LecPine.csv", "fulldata_LecOnly.csv", "fulldata_PineOnly.csv" - full/compiled data for all *N. lecontei* and *N. pinetum* individuals together, all *N. lecontei* individuals, and all *N. pinetum* individuals included in downstream analyses, respectively. These files are required to run the "05_plot_PCAs_ADMIXTURE.R" and "07_run_models.R" scripts. Description of columns in the *.csv files (#5) above (each row represents an individual sawfly): 1. "ind" = unique individual identifier assigned by the sequencing center (Admera Health) 2. "genPC1_*" = genetic PC1 score from PCA of both species combined ("genPC1_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC1_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC1_PineOnly" in "fulldata_PineOnly.csv" file) 3. "genPC2_*" = genetic PC2 score from PCA of both species combined ("genPC2_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC2_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC2_PineOnly" in "fulldata_PineOnly.csv" file) 4. "genPC3_*" = genetic PC3 score from PCA of both species combined ("genPC3_LecPine" in "fulldata_LecPine.csv" file), *N. lecontei* only ("genPC3_LecOnly" in "fulldata_LecOnly.csv" file), or *N. pinetum* only ("genPC3_PineOnly" in "fulldata_PineOnly.csv" file) 5. "AG_ID" = unique individual identifier assigned by the Linnen lab 6. "avg_perc_PP_reads_mapped" = percentage of properly paired reads mapped; averaged between the two sequencing batches 7. "MERGED_FINAL_read_count" = total read count across two sequencing batches 8. "species" = species identity 9. "city" = city individual was collected in 10. "state" = state individual was collected in 11. "area" = geographic area ("North" or "Central") that individual was assigned to based on ADMIXTURE ("03_ADMIXTURE" folder) and PCA ("04_run_pcas.sh" script) analyses 12. "latitude" = latitude individual was collected at 13. "longitude" = longitude individual was collected at 14. "host" = pine host individual was collected on 15. "stage" = life stage of individual at time of DNA extraction, either "larva" or adult female ("adultF") 16. "lcPC1" = PC1 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder) 17. "lcPC2" = PC2 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder) 18. "change_tmax" = change in average daily maximum temperature from the 1980's to the 2010's in degrees Celsius (from analyses in "02_temperature_data" folder) 19. "site_number" = arbitrary number assigned to sampling site; each unique sampling site (i.e., unique latitude/longitude GPS coordinates) was assigned an arbitrary number (1-189) 20. "grouped_site_number_5km" = grouped site number, where all sampling sites that were within 5km of each other were considered a “grouped site” and were assigned an arbitrary grouped site number 21. "allele_imbalance" = percentage of heterozygote calls displaying allele balance ratios less than 0.3 22. "avg_depth_coverage" = average depth coverage across all linkage-pruned SNPs 23. "missingness" = proportion of sites with a missing genotype 24. "heterozygosity" = proportion of sites with a heterozygous genotype 25. "pop" = population label for making ADMIXTURE plots ("05_plot_PCAs_ADMIXTURE.R" script); [species]_[sampling state] 26. "admix_prop_k2" = admixture proportion for K=2 (i.e., species-level admixture) 27. "lec_blue_k2" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=2 28. "pine_orange_k2" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=2 29. "lec_blue_k3" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=3 30. "pine_orange_k3" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=3 31. "lec_purple_k3" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=3 32. "lec_blue_k4" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 33. "pine_orange_k4" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to *N. pinetum* ancestry across all 100 ADMIXTURE runs for K=4 34. "lec_purple_k4" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 35. "lec_green_k4" = final assignment probability for "green" genetic group (color assigned by CLUMPAK) that corresponds to *N. lecontei* ancestry across all 100 ADMIXTURE runs for K=4 Perform analyses: The first folder ("01_LandCover_data") contains the R script to calculate land cover proportions within a 5km-radius buffer around each site's GPS coordinates ("get_LandCover_props.R"). The required input land cover raster files to run this script are available in the public domain at ([https://www.mrlc.gov/viewer/](https://www.mrlc.gov/viewer/)) for the United States and ([http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2020/](http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2020/)) for Canada. This folder also contains the input files ("for_lcPCA.csv", "latlong.csv") and script ("LandCover_PCA.R") necessary to complete the PCA of land cover proportions for each of the 189 sampling sites. Specifically, the "for_lcPCA.csv" file contains, for each sampling site, the proportions of each land cover class within a 5km radius buffer around the GPS coordinates of the sampling site; the "latlong.csv" file contains the GPS coordinates (latitude and longitude) of each sampling site. The second folder ("02_temperature_data") contains the input file ("LatLong.csv") and script ("get_tmax.R") necessary to extract the average of daily maximum temperature for each month within each year (1980-1989 and 2010-2019) for the latitude/longitude that each individual sawfly was collected at. The required input temperature raster files to run this script are available in the public domain at ([https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2131](https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2131)) for the United States and Canada. The third folder ("03_ADMIXTURE") contains the script ("run_admixture.sh") to perform 100 ADMIXTURE runs for each K (1-10). The output file from CLUMPAK for K=2 is also provided ("ClumppIndFile.output"). In this text file, each row represents an individual; the proportion of the individual's ancestry originating from each of the two inferred ancestral populations are provided (last two columns), from which admixture proportions can be extracted. The script to perform the next analysis (PCA) is provided ("04_run_pcas.sh"). Next, the script to plot the results from the ADMIXTURE and PCA analyses above is provided ("05_plot_PCAs_ADMIXTURE.R"). The next folder ("06_ABBA-BABA") contains the scripts (".sh") and input files (".txt" and ".nwk") to perform the ABBA-BABA analysis for the "North" and "Central" geographic regions separately. For each region, the *"*.txt" file indicates which of the four taxa each individual belongs to (i.e., P~1~, P~2~, P~3~, or Outgroup) - individuals that are in the vcf file but excluded from the analysis are indicated with an "xxx". The ".nwk" file is a text file that represents the phylogenetic tree of the four taxa above in the Newick format (i.e., uses parentheses and commas to show the relationships between taxa). Finally, the script to perform the multiple linear regression models and plot the model results is provided ("07_run_models.R"). ## Code/software All scripts to perform analyses are described above. All software and R package versions are described in the Materials and Methods section of the manuscript."]}
@article{osti_10453016,
place = {Country unknown/Code not available},
title = {NEON Tree Crowns Dataset},
url = {https://par.nsf.gov/biblio/10453016},
DOI = {10.5281/zenodo.3765871},
abstractNote = {{"Abstract":["Abstract<\/strong><\/p>\n\nThe NeonTreeCrowns dataset is a set of individual level crown estimates for 100 million trees at 37 geographic sites across the United States surveyed by the National Ecological Observation Network\u2019s Airborne Observation Platform. Each rectangular bounding box crown prediction includes height, crown area, and spatial location. <\/p>\n\nHow can I see the data?<\/strong><\/p>\n\nA web server to look through predictions is available through idtrees.org<\/p>\n\nDataset Organization<\/strong><\/p>\n\nThe shapefiles.zip contains 11,000 shapefiles, each corresponding to a 1km^2 RGB tile from NEON (ID: DP3.30010.001). For example "2019_SOAP_4_302000_4100000_image.shp" are the predictions from "2019_SOAP_4_302000_4100000_image.tif" available from the NEON data portal: https://data.neonscience.org/data-products/explore?search=camera. NEON's file convention refers to the year of data collection (2019), the four letter site code (SOAP), the sampling event (4), and the utm coordinate of the top left corner (302000_4100000). For NEON site abbreviations and utm zones see https://www.neonscience.org/field-sites/field-sites-map. <\/p>\n\nThe predictions are also available as a single csv for each file. All available tiles for that site and year are combined into one large site. These data are not projected, but contain the utm coordinates for each bounding box (left, bottom, right, top). For both file types the following fields are available:<\/p>\n\nHeight: The crown height measured in meters. Crown height is defined as the 99th quartile of all canopy height pixels from a LiDAR height model (ID: DP3.30015.001)<\/p>\n\nArea: The crown area in m2<\/sup> of the rectangular bounding box.<\/p>\n\nLabel: All data in this release are "Tree".<\/p>\n\nScore: The confidence score from the DeepForest deep learning algorithm. The score ranges from 0 (low confidence) to 1 (high confidence)<\/p>\n\nHow were predictions made?<\/strong><\/p>\n\nThe DeepForest algorithm is available as a python package: https://deepforest.readthedocs.io/. Predictions were overlaid on the LiDAR-derived canopy height model. Predictions with heights less than 3m were removed.<\/p>\n\nHow were predictions validated?<\/strong><\/p>\n\nPlease see<\/p>\n\nWeinstein, B. G., Marconi, S., Bohlman, S. A., Zare, A., & White, E. P. (2020). Cross-site learning in deep learning RGB tree crown detection. Ecological Informatics<\/em>, 56<\/em>, 101061.<\/p>\n\nWeinstein, B., Marconi, S., Aubry-Kientz, M., Vincent, G., Senyondo, H., & White, E. (2020). DeepForest: A Python package for RGB deep learning tree crown delineation. bioRxiv<\/em>.<\/p>\n\nWeinstein, Ben G., et al. "Individual tree-crown detection in RGB imagery using semi-supervised deep learning neural networks." Remote Sensing<\/em> 11.11 (2019): 1309.<\/p>\n\nWere any sites removed?<\/strong><\/p>\n\nSeveral sites were removed due to poor NEON data quality. GRSM and PUUM both had lower quality RGB data that made them unsuitable for prediction. NEON surveys are updated annually and we expect future flights to correct these errors. We removed the GUIL puerto rico site due to its very steep topography and poor sunangle during data collection. The DeepForest algorithm responded poorly to predicting crowns in intensely shaded areas where there was very little sun penetration. We are happy to make these data are available upon request.<\/p>\n\n# Contact<\/p>\n\nWe welcome questions, ideas and general inquiries. The data can be used for many applications and we look forward to hearing from you. Contact ben.weinstein@weecology.org. <\/p>"],"Other":["Gordon and Betty Moore Foundation: GBMF4563"]}},
journal = {},
publisher = {Zenodo},
author = {Weinstein, Ben and Marconi, Sergio and Zare, Alina and Bohlman, Stephanie and Graves, Sarah and Singh, Aditya and White, Ethan},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.