skip to main content


Title: Automatic Dialect Density Estimation for African American English
In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology.  more » « less
Award ID(s):
2202585
NSF-PAR ID:
10426185
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
ISCA
Date Published:
Journal Name:
INTERSPEECH 2023
Page Range / eLocation ID:
1283-1287
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect. 
    more » « less
  2. IEEE SIGNAL PROCESSING SOCIETY (Ed.)
    This paper 1 presents a novel system which utilizes acoustic, phonological, morphosyntactic, and prosodic information for binary automatic dialect detection of African American English. We train this system utilizing adult speech data and then evaluate on both children’s and adults’ speech with unmatched training and testing scenarios. The proposed system combines novel and state-of-the-art architectures, including a multi-source transformer language model pre-trained on Twitter text data and fine-tuned on ASR transcripts as well as an LSTM acoustic model trained on self-supervised learning representations, in order to learn a comprehensive view of dialect. We show robust, explainable performance across recording conditions for different features for adult speech, but fusing multiple features is important for good results on children’s speech. 
    more » « less
  3. Abstract Research has suggested that children who speak African American English (AAE) have difficulty using features produced in Mainstream American English (MAE) but not AAE, to comprehend sentences in MAE. However, past studies mainly examined dialect features, such as verbal -s , that are produced as final consonants with shorter durations when produced in conversation which impacts their phonetic saliency. Therefore, it is unclear if previous results are due to the phonetic saliency of the feature or how AAE speakers process MAE dialect features more generally. This study evaluated if there were group differences in how AAE- and MAE-speaking children used the auxiliary verbs was and were, a dialect feature with increased phonetic saliency but produced differently between the dialects, to interpret sentences in MAE. Participants aged 6, 5–10, and 0 years, who spoke MAE or AAE, completed the DELV-ST, a vocabulary measure (PVT), and a sentence comprehension task. In the sentence comprehension task, participants heard sentences in MAE that had either unambiguous or ambiguous subjects. Sentences with ambiguous subjects were used to evaluate group differences in sentence comprehension. AAE-speaking children were less likely than MAE-speaking children to use the auxiliary verbs was and were to interpret sentences in MAE. Furthermore, dialect density was predictive of Black participant’s sensitivity to the auxiliary verb. This finding is consistent with how the auxiliary verb is produced between the two dialects: was is used to mark both singular and plural subjects in AAE, while MAE uses was for singular and were for plural subjects. This study demonstrated that even when the dialect feature is more phonetically salient, differences between how verb morphology is produced in AAE and MAE impact how AAE-speaking children comprehend MAE sentences. 
    more » « less
  4. We investigate how annotators’ insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers of African American English (AAE) and ratings of toxicity in several widely used hate speech datasets. Then, we show that models trained on these corpora acquire and propagate these biases, such that AAE tweets and tweets by self-identified African Americans are up to two times more likely to be labelled as offensive compared to others. Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive. 
    more » « less
  5. Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a taskagnostic unsupervised way of incorporating semantic information from LLMs into selfsupervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding (SLU) performance by over 5% on intent classification (IC), with modest gains in named entity resolution (NER) and slot filling (SF), and spoken question answering (SQA) FF1 score by over 2%. Our approach, which uses no ASR data, achieves similar performance as methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders. 
    more » « less