skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: How Many Raters Can Be Enough: G Theory Applied to Assessment and Measurement of L2 Speech Perception
This paper extends the use of Generalizability Theory to the measurement of extemporaneous L2 speech through the lens of speech perception. Using six datasets of previous studies, it reports on G studies–a method of breaking down measurement variance–and D studies–a predictive study of the impact on reliability when modifying the number of raters, items, or other facets that assist the field in adopting measurement designs that include comprehensibility, accentedness, and intelligibility. When data from a single audio sample per learner were subjected to D-studies, we find that both semantic differential and rubric scales for comprehensibility were reliable at the .90 level with about 15 trained raters or 50 untrained crowdsourced raters. In order to offer generalizable and dependable evaluations, empirically informed recommendations are given, including considerations for the number of speech samples rated, or the granularity of the scales for various assessment and research purposes.  more » « less
Award ID(s):
2140469
PAR ID:
10531380
Author(s) / Creator(s):
;
Publisher / Repository:
European Knowledge Development Institute
Date Published:
Journal Name:
Language Teaching Research Quarterly
Volume:
37
ISSN:
2667-6753
Page Range / eLocation ID:
213 to 230
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As US society continues to diversify and calls for better measurements of racialized appearance increase, survey researchers need guidance about effective strategies for assessing skin color in field research. This study examined the consistency, comparability, and meaningfulness of the two most widely used skin tone rating scales (Massey–Martin and PERLA) and two portable and inexpensive handheld devices for skin color measurement (Nix colorimeter and Labby spectrophotometer). We collected data in person using these four instruments from forty-six college students selected to reflect a wide range of skin tones across four racial-ethnic groups (Asian, Black, Latinx, White). These college students, five study staff, and 459 adults from an online sample also rated forty stock photos, again selected for skin tone diversity. Our results—based on data collected under controlled conditions—demonstrate high consistency across raters and readings. The Massey–Martin and PERLA scale scores were highly linearly related to each other, although PERLA better differentiated among people with the lightest skin tones. The Nix and Labby darkness-to-lightness (L*) readings were likewise linearly related to each other and to the Massey–Martin and PERLA scores, in addition to showing expected variation within and between race ethnicities. In addition, darker Massey–Martin and PERLA ratings correlated with online raters’ expectations that a photographed person experienced greater discrimination. In contrast, the redness (a*) and yellowness (b*) undertones were highest in the mid-range of the rating scale scores and demonstrated greater overlap across race-ethnicities. Overall, each instrument showed sufficient consistency, comparability, and meaningfulness for use in field surveys when implemented soundly (e.g., not requiring memorization). However, PERLA might be preferred to Massey–Martin in studies representing individuals with the lightest skin tones, and handheld devices may be preferred to rating scales to reduce measurement error when studies could gather only a single rating. 
    more » « less
  2. The advancement of Speech Emotion Recognition (SER) is significantly dependent on the quality of emotional speech corpora used for model training. Researchers in the field of SER have developed various corpora by adjusting design parameters to enhance the reliability of the training source. For this study, we focus on exploring communication modes of collection, specifically analyzing spontaneous emotional speech patterns gathered during conversation or monologue. While conversations are acknowledged as effective for eliciting authentic emotional expressions, systematic analyses are necessary to confirm their reliability as a better source of emotional speech data. We investigate this research question from perceptual differences and acoustic variability present in both emotional speeches. Our analyses on multi-lingual corpora show that, first, raters exhibit higher consistency for conversation recordings when evaluating categorical emotions, and second, perceptions and acoustic patterns observed in conversational samples align more closely with expected trends discussed in relevant emotion literature. We further examine the impact of these differences on SER modeling, which shows that we can train a more robust and stable SER model by using conversation data. This work provides comprehensive evidence suggesting that conversation may offer a better source compared to monologue for developing an SER model. 
    more » « less
  3. Abstract Pairwise comparison models are an important type of latent attribute measurement model with broad applications in the social and behavioural sciences. Current pairwise comparison models are typically unidimensional. The existing multidimensional pairwise comparison models tend to be difficult to interpret and they are unable to identify groups of raters that share the same rater-specific parameters. To fill this gap, we propose a new multidimensional pairwise comparison model with enhanced interpretability which explicitly models how object attributes on different dimensions are differentially perceived by raters. Moreover, we add a Dirichlet process prior on rater-specific parameters which allows us to flexibly cluster raters into groups with similar perceptual orientations. We conduct simulation studies to show that the new model is able to recover the true latent variable values from the observed binary choice data. We use the new model to analyse original survey data regarding the perceived truthfulness of statements on COVID-19 collected in the summer of 2020. By leveraging the strengths of the new model, we find that the partisanship of the speaker and the partisanship of the respondent account for the majority of the variation in perceived truthfulness, with statements made by co-partisans being viewed as more truthful. 
    more » « less
  4. Speech Emotion Recognition (SER) faces a distinct challenge compared to other speech-related tasks because the annotations will show the subjective emotional perceptions of different annotators. Previous SER studies often view the subjectivity of emotion perception as noise by using the majority rule or plurality rule to obtain the consensus labels. However, these standard approaches overlook the valuable information of labels that do not agree with the consensus and make it easier for the test set. Emotion perception can have co-occurring emotions in realistic conditions, and it is unnecessary to regard the disagreement between raters as noise. To bridge the SER into a multi-label task, we introduced an “all-inclusive rule,” which considers all available data, ratings, and distributional labels as multi-label targets and a complete test set. We demonstrated that models trained with multi-label targets generated by the proposed AR outperform conventional single-label methods across incomplete and complete test sets. 
    more » « less
  5. An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels. 
    more » « less