skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: iNaturalist is an open science resource for ecological genomics by enabling rapid and tractable records of initial observations of sequenced biological samples
The rapidly growing body of publicly available sequencing data for rare species and/or wild-caught samples is accelerating the need for detailed records of the samples used to generate datasets. Many already published datasets are unlikely to ever be reused, not due to problems with the data themselves, but due to their questionable or unverifiable origins. In this paper, I present iNaturalist—a pre-existing citizen science platform that allows people to post photo observations of organisms in nature—as a tool that allows genomics researchers to rapidly publish observations of samples used to generate sequencing datasets. This practice aligns with the values of the open science movement, and I also discuss how iNaturalist, along with other online resources, can be used to create an open genomics pipeline that enables future replication studies and ensures the value of genomics datasets to future research.  more » « less
Award ID(s):
2010772
PAR ID:
10507123
Author(s) / Creator(s):
Publisher / Repository:
Biology Letters
Date Published:
Journal Name:
Biology Letters
Volume:
19
Issue:
10
ISSN:
1744-957X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over the past decade, museum genomics studies have focused on obtaining DNA of sufficient quality and quantity for sequencing from fluid-preserved natural history specimens, primarily to be used in systematic studies. While these studies have opened windows to evolutionary and biodiversity knowledge of many species worldwide, published works often focus on the success of these DNA sequencing efforts, which is undoubtedly less common than obtaining minimal or sometimes no DNA or unusable sequence data from specimens in natural history collections. Here, we attempt to obtain and sequence DNA extracts from 115 fresh and 41 degraded samples of homalopsid snakes, as well as from two degraded samples of a poorly known snake, Hydrablabes periops . Hydrablabes has been suggested to belong to at least two different families (Natricidae and Homalopsidae) and with no fresh tissues known to be available, intractable museum specimens currently provide the only opportunity to determine this snake’s taxonomic affinity. Although our aim was to generate a target-capture dataset for these samples, to be included in a broader phylogenetic study, results were less than ideal due to large amounts of missing data, especially using the same downstream methods as with standard, high-quality samples. However, rather than discount results entirely, we used mapping methods with references and pseudoreferences, along with phylogenetic analyses, to maximize any usable molecular data from our sequencing efforts, identify the taxonomic affinity of H. periops , and compare sequencing success between fresh and degraded tissue samples. This resulted in largely complete mitochondrial genomes for five specimens and hundreds to thousands of nuclear loci (ultra-conserved loci, anchored-hybrid enrichment loci, and a variety of loci frequently used in squamate phylogenetic studies) from fluid-preserved snakes, including a specimen of H. periops from the Field Museum of Natural History collection. We combined our H. periops data with previously published genomic and Sanger-sequenced datasets to confirm the familial designation of this taxon, reject previous taxonomic hypotheses, and make biogeographic inferences for Hydrablabes . A second H. periops specimen, despite being seemingly similar for initial raw sequencing results and after being put through the same protocols, resulted in little usable molecular data. We discuss the successes and failures of using different pipelines and methods to maximize the products from these data and provide expectations for others who are looking to use DNA sequencing efforts on specimens that likely have degraded DNA. Life Science Identifier ( Hydrablabes periops ) urn:lsid:zoobank.org :pub:F2AA44 E2-D2EF-4747-972A-652C34C2C09D. 
    more » « less
  2. Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4 ) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work. 
    more » « less
  3. Microorganisms are ubiquitous in the biosphere, playing a crucial role in both biogeochemistry of the planet and human health. However, identifying these microorganisms and defining their function are challenging. Widely used approaches in comparative metagenomics, 16S amplicon sequencing and whole genome shotgun sequencing (WGS), have provided access to DNA sequencing analysis to identify microorganisms and evaluate diversity and abundance in various environments. However, advances in parallel high-throughput DNA sequencing in the past decade have introduced major hurdles, namely standardization of methods, data storage, reproducible interoperability of results, and data sharing. The National Ecological Observatory Network (NEON), established by the National Science Foundation, enables all researchers to address queries on a regional to continental scale around a variety of environmental challenges and provide high-quality, integrated, and standardized data from field sites across the U.S. As the amount of metagenomic data continues to grow, standardized procedures that allow results across projects to be assessed and compared is becoming increasingly important in the field of metagenomics. We demonstrate the feasibility of using publicly available NEON soil metagenomic sequencing datasets in combination with open access Metagenomics Rapid Annotation using the Subsystem Technology (MG-RAST) server to illustrate advantages of WGS compared to 16S amplicon sequencing. Four WGS and four 16S amplicon sequence datasets, from surface soil samples prepared by NEON investigators, were selected for comparison, using standardized protocols collected at the same locations in Colorado between April-July 2014. The dominant bacterial phyla detected across samples agreed between sequencing methodologies. However, WGS yielded greater microbial resolution, increased accuracy, and allowed identification of more genera of bacteria, archaea, viruses, and eukaryota, and putative functional genes that would have gone undetected using 16S amplicon sequencing. NEON open data will be useful for future studies characterizing and quantifying complex ecological processes associated with changing aquatic and terrestrial ecosystems. 
    more » « less
  4. Qin, Hong (Ed.)
    iNaturalist has the potential to be an extremely rich source of organismal occurrence data. Launched in 2008, it now contains over 150 million uploaded observations as of May 2023. Based on the findings of a limited number of past studies assessing the taxonomic accuracy of participatory science-driven sources of occurrence data such as iNaturalist, there has been concern that some portion of these records might be misidentified in certain taxonomic groups. In this case study, we compare Research Grade iNaturalist observations with digitized herbarium specimens, both of which are currently available for combined download from large data aggregators and are therefore the primary sources of occurrence data for large-scale biodiversity/biogeography studies. Our comparisons were confined regionally to the southeastern United States (Florida, Georgia, North Carolina, South Carolina, Texas, Tennessee, Kentucky, and Virginia). Occurrence records from ten plant families (Gentianaceae, Ericaceae, Melanthiaceae, Ulmaceae, Fabaceae, Asteraceae, Fagaceae, Cyperaceae, Juglandaceae, Apocynaceae) were downloaded and scored on taxonomic accuracy. We found a comparable and relatively low rate of misidentification among both digitized herbarium specimens and Research Grade iNaturalist observations within the study area. This finding illustrates the utility and high quality of iNaturalist data for future research in the region, but also points to key differences between data types, giving each a respective advantage, depending on applications of the data. 
    more » « less
  5. Abstract Understanding the ranges of rare and endangered species is central to conserving biodiversity in the Anthropocene. Species distribution models (SDMs) have become a common and powerful tool for analyzing species–environment relationships across geographic space. Although evaluating the distribution of rare species is integral to their conservation, this can be difficult when limited distribution data are available. Community science platforms, such as iNaturalist, have emerged as alternative sources for species occurrence data. Although these observations are often thought to be of lower quality than those of natural history collections, they may have potential for improving SDMs for species with few occurrence records from collections. Here, we investigate the utility of iNaturalist data for developing SDMs for a rare high‐elevation plant,Telesonix jamesii. Because methods for modeling rare species are limited in the literature, five different modeling techniques were considered, including profile methods, statistical models, and machine learning algorithms. The inclusion of iNaturalist data doubled the number of usable records forT. jamesii.We found that a random forest (RF) model using ensemble training data performed the highest of any model (area under curve = 0.98). We then compared the performance of RF models that use only natural history training data and those that use a combination of natural history (herbarium specimens) and iNaturalist training data. All models heavily relied on climate data (mean temperature of driest quarter, and precipitation of the warmest quarter), indicating that this species is under threat as climate continues to change. Validation datasets affected model fits as well. Models using only herbarium data performed slightly poorer when evaluated with cross‐validation than when validated externally with iNaturalist data. This study can serve as a model for future SDM studies of species with similar data limitations. 
    more » « less