skip to main content

Title: CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files
This paper presents a novel dataset (CORAAL QA) and framework for audio question-answering from long audio recordings contain- ing spontaneous speech. The dataset introduced here provides sets of questions that can be factually answered from short spans of a long audio files (typically 30min to 1hr) from the Corpus of Re- gional African American Language. Using this dataset, we divide the audio recordings into 60 second segments, automatically tran- scribe each segment, and use PLDA scoring of BERT-based seman- tic embeddings to rank the relevance of ASR transcript segments in answering the target question. In order to improve this framework through data augmentation, we use large language models including ChatGPT and Llama 2 to automatically generate further training ex- amples and show how prompt engineering can be optimized for this process. By creatively leveraging knowledge from large-language models, we achieve state-of-the-art question-answering performance in this information retrieval task.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing
Page Range / eLocation ID:
13371 to 13375
Medium: X
Seoul, Korea, Republic of
Sponsoring Org:
National Science Foundation
More Like this
  1. Analyzing the quality of classroom talk is central to educational research and improvement efforts. In particular, the presence of authentic teacher questions, where answers are not predetermined by the teacher, helps constitute and serves as a marker of productive classroom discourse. Further, authentic questions can be cultivated to improve teaching effectiveness and consequently student achievement. Unfortunately, current methods to measure question authenticity do not scale because they rely on human observations or coding of teacher discourse. To address this challenge, we set out to use automatic speech recognition, natural language processing, and machine learning to train computers to detect authentic questions in real-world classrooms automatically. Our methods were iteratively refined using classroom audio and human coded observational data from two sources: (a) a large archival database of text transcripts of 451 observations from 112 classrooms; and (b) a newly collected sample of 132 high-quality audio recordings from 27 classrooms, obtained under technical constraints that anticipate large-scale automated data collection and analysis. Correlations between human coded and computer-coded authenticity at the classroom level were sufficiently high (r = .602 for archival transcripts and .687 for audio recordings) to provide a valuable complement to human coding in research efforts. 
    more » « less
  2. In the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company’s performance that is not present in quantitative predictors. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). In recent years, there has been a great progress in natural language processing via pre-trained language models (LMs) learned from large corpora of textual data. This prompts the important question of whether they can be used effectively to produce representations for long documents, as well as how we can evaluate the effectiveness of representations produced by various LMs. Our work focuses on answering this critical question, namely the evaluation of the efficacy of various LMs in extracting useful soft information from long textual documents for prediction tasks. In this paper, we propose and implement a deep learning evaluation framework that utilizes a sequential chunking approach combined with an attention mechanism. We perform an extensive set of experiments on a collection of 10-K reports submitted annually by US banks, and another dataset of reports submitted by US companies, in order to investigate thoroughly the performance of different types of language models. Overall, our framework using LMs outperforms strong baseline methods for textual modeling as well as for numerical regression. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs for representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.

    more » « less
  3. While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google’s responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmark T5 models on GooAQ and observe that: (a) in line with recent work, LM’s strong performance on GooAQ’s short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as ‘how’ and ‘why’ questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types. 
    more » « less
  4. Automatic discourse processing is bottlenecked by data: current discourse formalisms pose highly demanding annotation tasks involving large taxonomies of discourse relations, making them inaccessible to lay annotators. This work instead adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis and seeks to derive QUD structures automatically. QUD views each sentence as an answer to a question triggered in prior context; thus, we characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained taxonomies. We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large, crowdsourced question-answering dataset DCQA (Ko et al., 2022). Human evaluation results show that QUD dependency parsing is possible for language models trained with this crowdsourced, generalizable annotation scheme. We illustrate how our QUD structure is distinct from RST trees, and demonstrate the utility of QUD analysis in the context of document simplification. Our findings show that QUD parsing is an appealing alternative for automatic discourse processing. 
    more » « less
  5. null (Ed.)
    Speech and language development in children is crucial for ensuring optimal outcomes in their long term development and life-long educational journey. A child’s vocabulary size at the time of kindergarten entry is an early indicator of learning to read and potential long-term success in school. The preschool classroom is thus a promising venue for monitoring growth in young children by measuring their interactions with teachers and classmates. Automatic Speech Recognition (ASR) technologies provide the ability for ‘Early Childhood’ researchers for automatically analyzing naturalistic recordings in these settings. For this purpose, data are collected in a high-quality childcare center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. A preliminary task for ASR of daylong audio recordings would involve diarization, i.e., segmenting speech into smaller parts for identifying ‘who spoke when.’ This study investigates a Deep Learning-based diarization system for classroom interactions of 3-5-year-old children. However, the focus is on ’speaker group’ diarization, which includes classifying speech segments as being from adults or children from across multiple classrooms. SincNet based diarization systems achieve utterance level Diarization Error Rate of 19.1%. Utterance level speaker group confusion matrices also show promising, balanced results. These diarization systems have potential applications in developing metrics for adult-to-child or child-to-child rapid conversational turns in a naturalistic noisy early childhood setting. Such technical advancements will also help teachers better and more efficiently quantify and understand their interactions with children, make changes as needed, and monitor the impact of those changes. 
    more » « less