skip to main content


Title: Using demographics toward efficient data classification in citizen science: a Bayesian approach

Public participation in scientific activities, often called citizen science, offers a possibility to collect and analyze an unprecedentedly large amount of data. However, diversity of volunteers poses a challenge to obtain accurate information when these data are aggregated. To overcome this problem, we propose a classification algorithm using Bayesian inference that harnesses diversity of volunteers to improve data accuracy. In the algorithm, each volunteer is grouped into a distinct class based on a survey regarding either their level of education or motivation to citizen science. We obtained the behavior of each class through a training set, which was then used as a prior information to estimate performance of new volunteers. By applying this approach to an existing citizen science dataset to classify images into categories, we demonstrate improvement in data accuracy, compared to the traditional majority voting. Our algorithm offers a simple, yet powerful, way to improve data accuracy under limited effort of volunteers by predicting the behavior of a class of individuals, rather than attempting at a granular description of each of them.

 
more » « less
NSF-PAR ID:
10125047
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
PeerJ
Date Published:
Journal Name:
PeerJ Computer Science
Volume:
5
ISSN:
2376-5992
Page Range / eLocation ID:
Article No. e239
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This dataset contains machine learning and volunteer classifications from the Gravity Spy project. It includes glitches from observing runs O1, O2, O3a and O3b that received at least one classification from a registered volunteer in the project. It also indicates glitches that are nominally retired from the project using our default set of retirement parameters, which are described below. See more details in the Gravity Spy Methods paper. 

    When a particular subject in a citizen science project (in this case, glitches from the LIGO datastream) is deemed to be classified sufficiently it is "retired" from the project. For the Gravity Spy project, retirement depends on a combination of both volunteer and machine learning classifications, and a number of parameterizations affect how quickly glitches get retired. For this dataset, we use a default set of retirement parameters, the most important of which are: 

    1. A glitches must be classified by at least 2 registered volunteers
    2. Based on both the initial machine learning classification and volunteer classifications, the glitch has more than a 90% probability of residing in a particular class
    3. Each volunteer classification (weighted by that volunteer's confusion matrix) contains a weight equal to the initial machine learning score when determining the final probability

    The choice of these and other parameterization will affect the accuracy of the retired dataset as well as the number of glitches that are retired, and will be explored in detail in an upcoming publication (Zevin et al. in prep). 

    The dataset can be read in using e.g. Pandas: 
    ```
    import pandas as pd
    dataset = pd.read_hdf('retired_fulldata_min2_max50_ret0p9.hdf5', key='image_db')
    ```
    Each row in the dataframe contains information about a particular glitch in the Gravity Spy dataset. 

    Description of series in dataframe

    • ['1080Lines', '1400Ripples', 'Air_Compressor', 'Blip', 'Chirp', 'Extremely_Loud', 'Helix', 'Koi_Fish', 'Light_Modulation', 'Low_Frequency_Burst', 'Low_Frequency_Lines', 'No_Glitch', 'None_of_the_Above', 'Paired_Doves', 'Power_Line', 'Repeating_Blips', 'Scattered_Light', 'Scratchy', 'Tomte', 'Violin_Mode', 'Wandering_Line', 'Whistle']
      • Machine learning scores for each glitch class in the trained model, which for a particular glitch will sum to unity
    • ['ml_confidence', 'ml_label']
      • Highest machine learning confidence score across all classes for a particular glitch, and the class associated with this score
    • ['gravityspy_id', 'id']
      • Unique identified for each glitch on the Zooniverse platform ('gravityspy_id') and in the Gravity Spy project ('id'), which can be used to link a particular glitch to the full Gravity Spy dataset (which contains GPS times among many other descriptors)
    • ['retired']
      • Marks whether the glitch is retired using our default set of retirement parameters (1=retired, 0=not retired)
    • ['Nclassifications']
      • The total number of classifications performed by registered volunteers on this glitch
    • ['final_score', 'final_label']
      • The final score (weighted combination of machine learning and volunteer classifications) and the most probable type of glitch
    • ['tracks']
      • Array of classification weights that were added to each glitch category due to each volunteer's classification

     

    ```
    For machine learning classifications on all glitches in O1, O2, O3a, and O3b, please see Gravity Spy Machine Learning Classifications on Zenodo

    For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.

    For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo. 

     
    more » « less
  2. Abstract

    We present analysis using a citizen science campaign to improve the cosmological measures from the Hobby–Eberly Telescope Dark Energy Experiment (HETDEX). The goal of HETDEX is to measure the Hubble expansion rate,H(z), and angular diameter distance,DA(z), atz= 2.4, each to percent-level accuracy. This accuracy is determined primarily from the total number of detected Lyαemitters (LAEs), the false positive rate due to noise, and the contamination due to [Oii] emitting galaxies. This paper presents the citizen science project, Dark Energy Explorers (https://www.zooniverse.org/projects/erinmc/dark-energy-explorers), with the goal of increasing the number of LAEs and decreasing the number of false positives due to noise and the [Oii] galaxies. Initial analysis shows that citizen science is an efficient and effective tool for classification most accurately done by the human eye, especially in combination with unsupervised machine learning. Three aspects from the citizen science campaign that have the most impact are (1) identifying individual problems with detections, (2) providing a clean sample with 100% visual identification above a signal-to-noise cut, and (3) providing labels for machine-learning efforts. Since the end of 2022, Dark Energy Explorers has collected over three and a half million classifications by 11,000 volunteers in over 85 different countries around the world. By incorporating the results of the Dark Energy Explorers, we expect to improve the accuracy on theDA(z) andH(z) parameters atz= 2.″4 by 10%–30%. While the primary goal is to improve on HETDEX, Dark Energy Explorers has already proven to be a uniquely powerful tool for science advancement and increasing accessibility to science worldwide.

     
    more » « less
  3. This paper explores the assumptions that citizen science (CS) project leaders had about their volunteers’ science inquiry skill–proficiency overall, and then examines volunteers’ actual proficiency in one specific skill, scientific observation, because it is fundamental to and shared by many projects. This work shares findings from interviews with 10 project leaders related to two common assumptions leaders have about their volunteers’ skill proficiency: one, that volunteers can perform the necessary skills to participate at the start of a CS project, and therefore may not need training; and two, volunteer skill proficiency improves over time through involvement in the CS project. In order to answer questions about the degree of accuracy to which volunteers can perform the necessary skills and about differences in their skill proficiency based on experience and data collection procedures, we analyzed data from seven CS projects that used two shared embedded assessment tools, each focused on skills within the context of scientific observation in natural settings: Notice relevant features for taxonomic identification and record standard observations. This across-project and cross-sectional study found that the majority of citizen science volunteers (n = 176) had the necessary skill proficiency to collect accurate scientific observations but proficiency varied based on volunteer experience and project data collection procedures.

     
    more » « less
  4. Abstract We present the first results from Citizen ASAS-SN, a citizen science project for the All-Sky Automated Survey for Supernovae (ASAS-SN) hosted on the Zooniverse platform. Citizen ASAS-SN utilizes the newer, deeper, higher cadence ASAS-SN g -band data and tasks volunteers to classify periodic variable star candidates based on their phased light curves. We started from 40,640 new variable candidates from an input list of ∼7.4 million stars with δ < −60° and the volunteers identified 10,420 new discoveries which they classified as 4234 pulsating variables, 3132 rotational variables, 2923 eclipsing binaries, and 131 variables flagged as Unknown. They classified known variable stars with an accuracy of 89% for pulsating variables, 81% for eclipsing binaries, and 49% for rotational variables. We examine user performance, agreement between users, and compare the citizen science classifications with our machine learning classifier updated for the g -band light curves. In general, user activity correlates with higher classification accuracy and higher user agreement. We used the user’s “Junk” classifications to develop an effective machine learning classifier to separate real from false variables, and there is a clear path for using this “Junk” training set to significantly improve our primary machine learning classifier. We also illustrate the value of Citizen ASAS-SN for identifying unusual variables with several examples. 
    more » « less
  5. Community and citizen science on climate change-influenced topics offers a way for participants to actively engage in understanding the changes and documenting the impacts. As in broader climate change education, a focus on the negative impacts can often leave participants feeling a sense of powerlessness. In large scale projects where participation is primarily limited to data collection, it is often difficult for volunteers to see how the data can inform decision making that can help create a positive future. In this paper, we propose and test a method of linking community and citizen science engagement to thinking about and planning for the future through scenarios story development using the data collected by the volunteers. We used a youth focused wild berry monitoring program that spanned urban and rural Alaska to test this method across diverse age levels and learning settings. Using qualitative analysis of educator interviews and youth work samples, we found that using a scenario stories development mini-workshop allowed the youth to use their own data and the data from other sites to imagine the future and possible actions to sustain berry resources for their communities. This process allowed youth to exercise key cognitive skills for sustainability, including systems thinking, futures thinking, and strategic thinking. The analysis suggested that youth would benefit from further practicing the skill of envisioning oneself as an agent of change in the environment. Educators valued working with lead scientists on the project and the experience for youth to participate in the interdisciplinary program. They also identified the combination of the berry data collection, analysis and scenarios stories activities as a teaching practice that allowed the youth to situate their citizen science participation in a personal, local and cultural context. The majority of the youth groups pursued some level of stewardship action following the activity. The most common actions included collecting additional years of berry data, communicating results to a broader community, and joining other community and citizen science projects. A few groups actually pursued solutions illustrated in the scenario stories. The pairing of community and citizen science with scenario stories development provides a promising method to connect data to action for a sustainable and resilient future. 
    more » « less