skip to main content


This content will become publicly available on April 1, 2025

Title: An exploratory study on dialect density estimation for children and adult's African American English
This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect.  more » « less
Award ID(s):
2202585 2202049
NSF-PAR ID:
10506584
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Acoustical Society of America
Date Published:
Journal Name:
The Journal of the Acoustical Society of America
Volume:
155
Issue:
4
ISSN:
0001-4966
Page Range / eLocation ID:
2836 to 2848
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ISCA (Ed.)
    In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology. 
    more » « less
  2. IEEE SIGNAL PROCESSING SOCIETY (Ed.)
    This paper 1 presents a novel system which utilizes acoustic, phonological, morphosyntactic, and prosodic information for binary automatic dialect detection of African American English. We train this system utilizing adult speech data and then evaluate on both children’s and adults’ speech with unmatched training and testing scenarios. The proposed system combines novel and state-of-the-art architectures, including a multi-source transformer language model pre-trained on Twitter text data and fine-tuned on ASR transcripts as well as an LSTM acoustic model trained on self-supervised learning representations, in order to learn a comprehensive view of dialect. We show robust, explainable performance across recording conditions for different features for adult speech, but fusing multiple features is important for good results on children’s speech. 
    more » « less
  3. Multimodal depression classification has gained immense popularity over the recent years. We develop a multimodal depression classification system using articulatory coordination features extracted from vocal tract variables and text transcriptions obtained from an automatic speech recognition tool that yields improvements of area under the receiver operating characteristics curve compared to unimodal classifiers (7.5% and 13.7% for audio and text respectively). We show that in the case of limited training data, a segment-level classifier can first be trained to then obtain a session-wise prediction without hindering the performance, using a multi-stage convolutional recurrent neural network. A text model is trained using a Hierarchical Attention Network (HAN). The multimodal system is developed by combining embeddings from the session-level audio model and the HAN text model. 
    more » « less
  4. This exploratory study examined the simultaneous interactions and relative contributions of bottom-up social information (regional dialect, speaking style), top-down contextual information (semantic predictability), and the internal dynamics of the lexicon (neighborhood density, lexical frequency) to lexical access and word recognition. Cross-modal matching and intelligibility in noise tasks were conducted with a community sample of adults at a local science museum. Each task featured one condition in which keywords were presented in isolation and one condition in which they were presented within a multiword phrase. Lexical processing was slower and more accurate when keywords were presented in their phrasal context, and was both faster and more accurate for auditory stimuli produced in the local Midland dialect. In both tasks, interactions were observed among stimulus dialect, speaking style, semantic predictability, phonological neighborhood density, and lexical frequency. These interactions revealed that bottom-up social information and top-down contextual information contribute more to speech processing than the internal dynamics of the lexicon. Moreover, the relatively stronger bottom-up social effects were observed in both the isolated word and multiword phrase conditions, suggesting that social variation is central to speech processing, even in non-interactive laboratory tasks. At the same time, the specific interactions observed differed between the two experiments, reflecting task-specific demands related to processing time constraints and signal degradation.

     
    more » « less
  5. We investigate how annotators’ insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers of African American English (AAE) and ratings of toxicity in several widely used hate speech datasets. Then, we show that models trained on these corpora acquire and propagate these biases, such that AAE tweets and tweets by self-identified African Americans are up to two times more likely to be labelled as offensive compared to others. Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive. 
    more » « less