skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: People make mistakes: Obtaining accurate ground truth from continuous annotations of subjective constructs
Abstract Accurately representing changes in mental states over time is crucial for understanding their complex dynamics. However, there is little methodological research on the validity and reliability of human-produced continuous-time annotation of these states. We present a psychometric perspective on valid and reliable construct assessment, examine the robustness of interval-scale (e.g., values between zero and one) continuous-time annotation, and identify three major threats to validity and reliability in current approaches. We then propose a novel ground truth generation pipeline that combines emerging techniques for improving validity and robustness. We demonstrate its effectiveness in a case study involving crowd-sourced annotation of perceived violence in movies, where our pipeline achieves a .95 Spearman correlation in summarized ratings compared to a .15 baseline. These results suggest that highly accurate ground truth signals can be produced from continuous annotations using additional comparative annotation (e.g., a versus b) to correct structured errors, highlighting the need for a paradigm shift in robust construct measurement over time.  more » « less
Award ID(s):
2204942
PAR ID:
10629189
Author(s) / Creator(s):
;
Publisher / Repository:
Springer
Date Published:
Journal Name:
Behavior Research Methods
Volume:
56
Issue:
8
ISSN:
1554-3528
Page Range / eLocation ID:
8784 to 8800
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advancements in two-photon calcium imaging have enabled scientists to record the activity of thousands of neurons with cellular resolution. This scope of data collection is crucial to understanding the next generation of neuroscience questions, but analyzing these large recordings requires automated methods for neuron segmentation. Supervised methods for neuron segmentation achieve state of-the-art accuracy and speed but currently require large amounts of manually generated ground truth training labels. We reduced the required number of training labels by designing a semi-supervised pipeline. Our pipeline used neural network ensembling to generate pseudolabels to train a single shallow U-Net. We tested our method on three publicly available datasets and compared our performance to three widely used segmentation methods. Our method outperformed other methods when trained on a small number of ground truth labels and could achieve state-of-the-art accuracy after training on approximately a quarter of the number of ground truth labels as supervised methods. When trained on many ground truth labels, our pipeline attained higher accuracy than that of state-of-the-art methods. Overall, our work will help researchers accurately process large neural recordings while minimizing the time and effort needed to generate manual labels. 
    more » « less
  2. Embeddedness is a construct with potential explanatory power in studies of student persistence, retention, and success. The goal of this study was to address a need for a brief embeddedness measure for use in small college undergraduate environments. Three measures of embeddedness were developed: an initial 43-item scale, a reduced 35-item scale, and a final brief 12-item version of the scale. A total of 450 undergraduate students at a small private liberal arts college were included in the study. Internal consistency reliability was assessed with both McDonald's omega and Cronbach's alpha, test–retest reliability was observed over a one-year period, factor analysis was used in item analysis, structural validity was examined with confirmatory factor analysis, and criterion-related validity was assessed with convergent and discriminant validity correlations. The results provide acceptable initial estimates of reliability and validity. 
    more » « less
  3. Wilson, Timothy (Ed.)
    This article presents key findings from a research project that evaluated the validity and probative value of cartridge-case comparisons under field-based conditions. Decisions provided by 228 trained firearm examiners across the US showed that forensic cartridge-case comparison is characterized by low error rates. However, inconclusive decisions constituted over one-fifth of all decisions rendered, complicating evaluation of the technique’s ability to yield unambiguously correct decisions. Specifically, restricting evaluation to only the conclusive decisions of identification and elimination yielded true-positive and true-negative rates exceeding 99%, but incorporating inconclusives caused these values to drop to 93.4% and 63.5%, respectively. The asymmetric effect on the two rates occurred because inconclusive decisions were rendered six times more frequently for different-source than same-source comparisons. Considering probative value, which is a decision’s usefulness for determining a comparison’s ground-truth state, conclusive decisions predicted their corresponding ground-truth states with near perfection. Likelihood ratios (LRs) further showed that conclusive decisions greatly increase the odds of a comparison’s ground-truth state matching the ground-truth state asserted by the decision. Inconclusive decisions also possessed probative value, predicting different-source status and having a LR indicating that they increase the odds of different-source status. The study also manipulated comparison difficulty by using two firearm models that produce dissimilar cartridge-case markings. The model chosen for being more difficult received more inconclusive decisions for same-source comparisons, resulting in a lower true-positive rate compared to the less difficult model. Relatedly, inconclusive decisions for the less difficult model exhibited more probative value, being more strongly predictive of different-source status. 
    more » « less
  4. Given significant concerns about fairness and bias in the use of artificial intelligence (AI) and machine learning (ML) for psychological assessment, we provide a conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) from a psychometric perspective. MLMB is defined as differential functioning of the trained ML model between subgroups. MLMB manifests empirically when a trained ML model produces different predicted score levels for different subgroups (e.g., race, gender) despite them having the same ground-truth levels for the underlying construct of interest (e.g., personality) and/or when the model yields differential predictive accuracies across the subgroups. Because the development of ML models involves both data and algorithms, both biased data and algorithm-training bias are potential sources of MLMB. Data bias can occur in the form of nonequivalence between subgroups in the ground truth, platform-based construct, behavioral expression, and/or feature computing. Algorithm-training bias can occur when algorithms are developed with nonequivalence in the relation between extracted features and ground truth (i.e., algorithm features are differentially used, weighted, or transformed between subgroups). We explain how these potential sources of bias may manifest during ML model development and share initial ideas for mitigating them, including recognizing that new statistical and algorithmic procedures need to be developed. We also discuss how this framework clarifies MLMB but does not reduce the complexity of the issue. 
    more » « less
  5. Despite the recent popularity of neural network-based solvers for optimal transport (OT), there is no standard quantitative way to evaluate their performance. In this paper, we address this issue for quadratic-cost transport---specifically, computation of the Wasserstein-2 distance, a commonly-used formulation of optimal transport in machine learning. To overcome the challenge of computing ground truth transport maps between continuous measures needed to assess these solvers, we use input-convex neural networks (ICNN) to construct pairs of measures whose ground truth OT maps can be obtained analytically. This strategy yields pairs of continuous benchmark measures in high-dimensional spaces such as spaces of images. We thoroughly evaluate existing optimal transport solvers using these benchmark measures. Even though these solvers perform well in downstream tasks, many do not faithfully recover optimal transport maps. To investigate the cause of this discrepancy, we further test the solvers in a setting of image generation. Our study reveals crucial limitations of existing solvers and shows that increased OT accuracy does not necessarily correlate to better results downstream. 
    more » « less