Title: Automatic Dialect Density Estimation for African American English
In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology. more »« less
Johnson, Alexander; Shetty, Vishwas; Ostendorf, Mari; and Alwan, Abeer(
, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
IEEE SIGNAL PROCESSING SOCIETY
(Ed.)
This paper 1 presents a novel system which utilizes acoustic, phonological, morphosyntactic, and prosodic information for binary automatic dialect detection of African American English. We train this system utilizing adult speech data and then evaluate on both children’s and adults’ speech with unmatched training and testing scenarios. The proposed system combines novel and state-of-the-art architectures, including a multi-source transformer language model pre-trained on Twitter text data and fine-tuned on ASR transcripts as well as an LSTM acoustic model trained on self-supervised learning representations, in order to learn a comprehensive view of dialect. We show robust, explainable performance across recording conditions for different features for adult speech, but fusing multiple features is important for good results on children’s speech.
Abstract Research has suggested that children who speak African American English (AAE) have difficulty using features produced in Mainstream American English (MAE) but not AAE, to comprehend sentences in MAE. However, past studies mainly examined dialect features, such as verbal -s , that are produced as final consonants with shorter durations when produced in conversation which impacts their phonetic saliency. Therefore, it is unclear if previous results are due to the phonetic saliency of the feature or how AAE speakers process MAE dialect features more generally. This study evaluated if there were group differences in how AAE- and MAE-speaking children used the auxiliary verbs was and were, a dialect feature with increased phonetic saliency but produced differently between the dialects, to interpret sentences in MAE. Participants aged 6, 5–10, and 0 years, who spoke MAE or AAE, completed the DELV-ST, a vocabulary measure (PVT), and a sentence comprehension task. In the sentence comprehension task, participants heard sentences in MAE that had either unambiguous or ambiguous subjects. Sentences with ambiguous subjects were used to evaluate group differences in sentence comprehension. AAE-speaking children were less likely than MAE-speaking children to use the auxiliary verbs was and were to interpret sentences in MAE. Furthermore, dialect density was predictive of Black participant’s sensitivity to the auxiliary verb. This finding is consistent with how the auxiliary verb is produced between the two dialects: was is used to mark both singular and plural subjects in AAE, while MAE uses was for singular and were for plural subjects. This study demonstrated that even when the dialect feature is more phonetically salient, differences between how verb morphology is produced in AAE and MAE impact how AAE-speaking children comprehend MAE sentences.
Chandra-Shekar, M.M.; Hansen, J.H.L.(
, NASA Human Research Program Investigators Conference)
INTRODUCTION: Apollo-11 (A-11) was the first manned space mission to successfully bring astronauts to the moon and return them safely. Effective team based communications is required for mission specialists to work collaboratively to learn, engage, and solve complex problems. As part of NASA’s goal in assessing team and mission success, all vital speech communications between these personnel were recorded using the multi-track SoundScriber system onto analog tapes, preserving their contribution in the success of one of the greatest achievements in human history. More than +400 personnel served as mission specialists/support who communicated across 30 audio loops, resulting in +9k hours of data for A-11. To ensure success of this mission, it was necessary for teams to communicate, learn, and address problems in a timely manner. Previous research has found that compatibility of individual personalities within teams is important for effective team collaboration of those individuals. Hence, it is essential to identify each speaker’s role during an Apollo mission and analyze group communications for knowledge exchange and problem solving to achieve a common goal. Assessing and analyzing speaker roles during the mission can allow for exploring engagement analysis for multi-party speaker situations.
METHOD: The UTDallas Fearless steps Apollo data is comprised of 19,000 hours (A-11,A-13,A-1) possessing unique and multiple challenges as it is characterized by severe noise and degradation as well as overlap instances over the 30 channels. For our study, we have selected a subset of 100 hours manually transcribed by professional annotators for speaker labels. The 100 hours are obtained from three mission critical events: 1. Lift-Off (25 hours) 2. Lunar-Landing (50 hours) 3. Lunar-Walking (25 hours). Five channels of interest, out of 30 channels were selected with the most speech activity, the primary speakers operating these five channels are command/owners of these channels. For our analysis, we select five speaker roles: Flight Director (FD), Capsule Communicator (CAPCOM), Guidance, Navigation and, Control (GNC), Electrical, environmental, and consumables manager (EECOM), and Network (NTWK). To track and tag individual speakers across our Fearless Steps audio dataset, we use the concept of ‘where’s Waldo’ to identify all instances of our speakers-of-interest across a cluster of other speakers. Also, to understand speaker roles of our speaker-of-interests, we use speaker duration of primary speaker vs secondary speaker and speaker turns as our metrics to determine the role of the speaker and to understand their responsibility during the three critical phases of the mission. This enables a content linking capability as well as provide a pathway to analyzing group engagement, group dynamics of people working together in an enclosed space, psychological effects, and cognitive analysis in such individuals.
IMPACT: NASA’s Apollo Program stands as one of the most significant contributions to humankind. This collection opens new research options for recognizing team communication, group dynamics, and human engagement/psychology for future deep space missions. Analyzing team communications to achieve such goals would allow for the formulation of educational and training technologies for assessment of STEM knowledge, task learning, and educational feedback. Also, identifying these personnel can help pay tribute and yield personal recognition to the hundreds of notable engineers and scientist who made this feat possible.
ILLUSTRATION: In this work, we propose to illustrate how a pre-trained speech/language network can be used to obtain powerful speaker embeddings needed for speaker diarization. This framework is used to build these learned embeddings to label unique speakers over sustained audio streams. To train and test our system, we will make use of Fearless Steps Apollo corpus, allowing us to effectively leverage a limited label information resource (100 hours of labeled data out of +9000 hours). Furthermore, we use the concept of 'Finding Waldo' to identify key speakers of interest (SOI) throughout the Apollo-11 mission audio across multiple channel audio streams.
Xu, Derek; Dong, Shuyan; Wang, Changhan; Kim, Suyoun; Lin, Zhaojiang; Liu, Bing; Shrivastava, Akshat; Li, Shang-Wen; Tseng, Liang-Hsuan; Lin, Guan-Ting; et al(
, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)
Recent studies find existing self-supervised
speech encoders contain primarily acoustic
rather than semantic information. As a result,
pipelined supervised automatic speech
recognition (ASR) to large language model
(LLM) systems achieve state-of-the-art results
on semantic spoken language tasks by utilizing
rich semantic representations from the LLM.
These systems come at the cost of labeled
audio transcriptions, which is expensive and
time-consuming to obtain. We propose a taskagnostic
unsupervised way of incorporating
semantic information from LLMs into selfsupervised
speech encoders without labeled audio
transcriptions. By introducing semantics,
we improve existing speech encoder spoken
language understanding (SLU) performance by
over 5% on intent classification (IC), with modest
gains in named entity resolution (NER) and
slot filling (SF), and spoken question answering
(SQA) FF1 score by over 2%. Our approach,
which uses no ASR data, achieves similar performance
as methods trained on over 100 hours
of labeled audio transcripts, demonstrating the
feasibility of unsupervised semantic augmentations
to existing speech encoders.
Tao, Fei; Busso, Carlos(
, IEEE International Conference on Multimedia and Expo (ICME))
Speech activity detection (SAD) is a key pre-processing step
for a speech-based system. The performance of conventional
audio-only SAD (A-SAD) systems is impaired by acoustic noise
when they are used in practical applications. An alternative approach
to address this problem is to include visual information,
creating audiovisual speech activity detection (AV-SAD) solutions.
In our previous work, we proposed to build an AV-SAD
system using bimodal recurrent neural network (BRNN). This
framework was able to capture the task-related characteristics
in the audio and visual inputs, and model the temporal information
within and across modalities. The approach relied on
long short-term memory (LSTM). Although LSTM can model
longer temporal dependencies with the cells, the effective memory
of the units is limited to a few frames, since the recurrent
connection only considers the previous frame. For SAD
systems, it is important to model longer temporal dependencies
to capture the semi-periodic nature of speech conveyed in
acoustic and orofacial features. This study proposes to implement
a BRNN-based AV-SAD system with advanced LSTMs
(A-LSTMs), which overcomes this limitation by including multiple
connections to frames in the past. The results show that the
proposed framework can significantly outperform the BRNN
system trained with the original LSTM layers.
@article{osti_10426185,
place = {Country unknown/Code not available},
title = {Automatic Dialect Density Estimation for African American English},
url = {https://par.nsf.gov/biblio/10426185},
DOI = {10.21437/Interspeech.2022-796},
abstractNote = {In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology.},
journal = {INTERSPEECH 2023},
author = {Johnson, Alexander and Everson, Kevin and Ravi, Vijay and Gladney, Anissa and Ostendorf, Mari and Alwan, Abeer},
editor = {ISCA}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.