skip to main content

Title: Citizen science, computing, and conservation: How can “Crowd AI” change the way we tackle large-scale ecological challenges?
Camera traps - remote cameras that capture images of passing wildlife - have become a ubiquitous tool in ecology and conservation. Systematic camera trap surveys generate ‘Big Data’ across broad spatial and temporal scales, providing valuable information on environmental and anthropogenic factors affecting vulnerable wildlife populations. However, the sheer number of images amassed can quickly outpace researchers’ ability to manually extract data from these images (e.g., species identities, counts, and behaviors) in timeframes useful for making scientifically-guided conservation and management decisions. Here, we present ‘Snapshot Safari’ as a case study for merging citizen science and machine learning to rapidly generate highly accurate ecological Big Data from camera trap surveys. Snapshot Safari is a collaborative cross-continental research and conservation effort with 1500+ cameras deployed at over 40 eastern and southern Africa protected areas, generating millions of images per year. As one of the first and largest-scale camera trapping initiatives, Snapshot Safari spearheaded innovative developments in citizen science and machine learning. We highlight the advances made and discuss the issues that arose using each of these methods to annotate camera trap data. We end by describing how we combined human and machine classification methods (‘Crowd AI’) to create an efficient integrated data more » pipeline. Ultimately, by using a feedback loop in which humans validate machine learning predictions and machine learning algorithms are iteratively retrained on new human classifications, we can capitalize on the strengths of both methods of classification while mitigating the weaknesses. Using Crowd AI to quickly and accurately ‘unlock’ ecological Big Data for use in science and conservation is revolutionizing the way we take on critical environmental issues in the Anthropocene era. « less
Authors:
; ; ; ;
Award ID(s):
1810586 1835530 1835272
Publication Date:
NSF-PAR ID:
10298984
Journal Name:
Human Computation
Volume:
8
Issue:
2
Page Range or eLocation-ID:
54 to 75
ISSN:
2330-8001
Sponsoring Org:
National Science Foundation
More Like this
  1. The research data repository of the Environmental Data Initiative (EDI) is building on over 30 years of data curation research and experience in the National Science Foundation-funded US Long-Term Ecological Research (LTER) Network. It provides mature functionalities, well established workflows, and now publishes all ‘long-tail’ environmental data. High quality scientific metadata are enforced through automatic checks against community developed rules and the Ecological Metadata Language (EML) standard. Although the EDI repository is far along in making its data findable, accessible, interoperable, and reusable (FAIR), representatives from EDI and the LTER are developing best practices for the edge cases in environmental data publishing. One of these is the vast amount of imagery taken in the context of ecological research, ranging from wildlife camera traps to plankton imaging systems to aerial photography. Many images are used in biodiversity research for community analyses (e.g., individual counts, species cover, biovolume, productivity), while others are taken to study animal behavior and landscape-level change. Some examples from the LTER Network include: using photos of a heron colony to measure provisioning rates for chicks (Clarkson and Erwin 2018) or identifying changes in plant cover and functional type through time (Peters et al. 2020). Multi-spectral images are employedmore »to identify prairie species. Underwater photo quads are used to monitor changes in benthic biodiversity (Edmunds 2015). Sosik et al. (2020) used a continuous Imaging FlowCytobot to identify and measure phyto- and microzooplankton. Cameras at McMurdo Dry Valleys assess snow and ice cover on Antarctic lakes allowing estimation of primary production (Myers 2019). It has been standard practice to publish numerical data extracted from images in EDI; however, the supporting imagery generally has not been made publicly available. Our goal in developing best practices for documenting and archiving these images is for them to be discovered and re-used. Our examples demonstrate several issues. The research questions, and hence, the image subjects are variable. Images frequently come in logical sets of time series. The size of such sets can be large and only some images may be contributed to a dedicated specialized repository. Finally, these images are taken in a larger monitoring context where many other environmental data are collected at the same time and location. Currently, a typical approach to publishing image data in EDI are packages containing compressed (ZIP or tar) files with the images, a directory manifest with additional image-specific metadata, and a package-level EML metadata file. Images in the compressed archive may be organized within directories with filenames corresponding to treatments, locations, time periods, individuals, or other grouping attributes. Additionally, the directory manifest table has columns for each attribute. Package-level metadata include standard coverage elements (e.g., date, time, location) and sampling methods. This approach of archiving logical ‘sets’ of images reduces the effort of providing metadata for each image when most information would be repeated, but at the expense of not making every image individually searchable. The latter may be overcome if the provided manifest contains standard metadata that would allow searching and automatic integration with other images.« less
  2. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA.« less
  3. Abstract The CCERS partnership includes collaborators from universities, foundations, education departments, community organizations, and cultural institutions to build a new curriculum. As reported in a study conducted by the Rand Corporation (2011), partnerships among districts, community-based organizations, government agencies, local funders, and others can strengthen learning programs. The curriculum merged project-based learning and Bybee’s 5E model (Note 1) to teach core STEM-C concepts to urban middle school students through restoration science. CCERS has five interrelated and complementary programmatic pillars (see details in the next section). The CCERS curriculum encourages urban middle school students to explore and participate in project-based learning activities restoring the oyster population in and around New York Harbor. In Melaville, Berg and Blank’s Community Based Learning (2001) there is a statement that says, “Education must connect subject matter with the places where students live and the issues that affect us all”. Lessons engage students and teachers in long-term restoration ecology and environmental monitoring projects with STEM professionals and citizen scientists. In brief, partners have created curriculums for both in-school and out-of-school learning programs, an online platform for educators and students to collaborate, and exhibits with community partners to reinforce and extend both the educators’ and their students’more »learning. Currently CCERS implementation involves: • 78 middle schools • 127 teachers • 110 scientist volunteers • Over 5000 K-12 students In this report, we present summative findings from data collected via surveys among three cohorts of students whose teachers were trained by the project’s curriculum and findings from interviews among project leaders to answer the following research questions: 1. Do the five programmatic pillars function independently and collectively as a system of interrelated STEM-C content delivery vehicles that also effectively change students’ and educators’ disposition towards STEM-C learning and environmental restoration and stewardship? 2. What comprises the "curriculum plus community enterprise" local model? 3. What are the mechanisms for creating sustainability and scalability of the model locally during and beyond its five-year implementation? 4. What core aspects of the model are replicable? Findings suggest the program improved students’ knowledge in life sciences but did not have a significant effect on students’ intent to become a scientist or affinity for science. Published by Sciedu Press 1 ISSN 2380-9183 E-ISSN 2380-9205 http://irhe.sciedupress.com International Research in Higher Education Vol. 3, No. 4; 2018 Interviews with project staff indicated that the key factors in the model were its conservation mission, partnerships, and the local nature of the issues involved. The primary mechanisms for sustainability and scalability beyond the five-year implementation were the digital platform, the curriculum itself, and the dissemination (with over 450 articles related to the project published in the media and academic journals). The core replicable aspects identified were the digital platform and adoption in other Keystone species contexts.« less
  4. 1. Camera trap technology has galvanized the study of predator-prey ecology in wild animal communities by expanding the scale and diversity of predator-prey interactions that can be analyzed. While observational data from systematic camera arrays have informed inferences on the spatiotemporal outcomes of predator-prey interactions, the capacity for observational studies to identify mechanistic drivers of species interactions is limited. 2. Experimental study designs that utilize camera traps uniquely allow for testing hypothesized mechanisms that drive predator and prey behavior, incorporating environmental realism not possible in the lab while benefiting from the distinct capacity of camera traps to generate large data sets from multiple species with minimal observer interference. However, such pairings of camera traps with experimental methods remain underutilized. 3. We review recent advances in the experimental application of camera traps to investigate fundamental mechanisms underlying predator-prey ecology and present a conceptual guide for designing experimental camera trap studies. 4. Only 9% of camera trap studies on predator-prey ecology in our review mention experimental methods, but the application of experimental approaches is increasing. To illustrate the utility of camera trap-based experiments using a case study, we propose a study design that integrates observational and experimental techniques to test a perennialmore »question in predator-prey ecology: how prey balance foraging and safety, as formalized by the risk allocation hypothesis. We discuss applications of camera trap-based experiments to evaluate the diversity of anthropogenic influences on wildlife communities globally. Finally, we review challenges to conducting experimental camera trap studies. 5. Experimental camera trap studies have already begun to play an important role in understanding the predator-prey ecology of free-living animals, and such methods will become increasingly critical to quantifying drivers of community interactions in a rapidly changing world. We recommend increased application of experimental methods in the study of predator and prey responses to humans, synanthropic and invasive species, and other anthropogenic disturbances.« less
  5. ABSTRACT Citizen science has helped astronomers comb through large data sets to identify patterns and objects that are not easily found through automated processes. The Milky Way Project (MWP), a citizen science initiative on the Zooniverse platform, presents internet users with infrared (IR) images from Spitzer Space Telescope Galactic plane surveys. MWP volunteers make classification drawings on the images to identify targeted classes of astronomical objects. We present the MWP second data release (DR2) and an updated data reduction pipeline written in python. We aggregate ∼3 million classifications made by MWP volunteers during the years 2012–2017 to produce the DR2 catalogue, which contains 2600 IR bubbles and 599 candidate bow shock driving stars. The reliability of bubble identifications, as assessed by comparison to visual identifications by trained experts and scoring by a machine-learning algorithm, is found to be a significant improvement over DR1. We assess the reliability of IR bow shocks via comparison to expert identifications and the colours of candidate bow shock driving stars in the 2MASS point-source catalogue. We hence identify highly reliable subsets of 1394 DR2 bubbles and 453 bow shock driving stars. Uncertainties on object coordinates and bubble size/shape parameters are included in the DR2 catalogue.more »Compared with DR1, the DR2 bubbles catalogue provides more accurate shapes and sizes. The DR2 catalogue identifies 311 new bow shock driving star candidates, including three associated with the giant H ii regions NGC 3603 and RCW 49.« less