skip to main content


Search for: All records

Creators/Authors contains: "Hansen, J.H.L."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. submitted - in Review for IEEE ICASSP-2024) (Ed.)
    The Fearless Steps Apollo (FS-APOLLO) resource is a collection of over 150,000 hours of audio, associated meta-data, and supplemental technological toolkit intended to benefit the (i) speech processing technology, (ii) communication science, team-based psychology, and history, and (iii) education/STEM, preservation/archival communities. The FSAPOLLO initiative which started in 2014 has since resulted in the preservation of over 75,000 hours of NASA Apollo Missions audio. Systems created for this audio collection have led to the emergence of several new Speech and Language Technologies (SLT). This paper seeks to provide an overview of the latest advancements in the FS-Apollo effort and explore upcoming strategies in big-data deployment, outreach, and novel avenues of K-12 and STEM education facilitated through this resource. 
    more » « less
    Free, publicly-accessible full text available April 16, 2025
  2. Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. 
    more » « less
  3. INTRODUCTION: Apollo-11 (A-11) was the first manned space mission to successfully bring astronauts to the moon and return them safely. Effective team based communications is required for mission specialists to work collaboratively to learn, engage, and solve complex problems. As part of NASA’s goal in assessing team and mission success, all vital speech communications between these personnel were recorded using the multi-track SoundScriber system onto analog tapes, preserving their contribution in the success of one of the greatest achievements in human history. More than +400 personnel served as mission specialists/support who communicated across 30 audio loops, resulting in +9k hours of data for A-11. To ensure success of this mission, it was necessary for teams to communicate, learn, and address problems in a timely manner. Previous research has found that compatibility of individual personalities within teams is important for effective team collaboration of those individuals. Hence, it is essential to identify each speaker’s role during an Apollo mission and analyze group communications for knowledge exchange and problem solving to achieve a common goal. Assessing and analyzing speaker roles during the mission can allow for exploring engagement analysis for multi-party speaker situations. METHOD: The UTDallas Fearless steps Apollo data is comprised of 19,000 hours (A-11,A-13,A-1) possessing unique and multiple challenges as it is characterized by severe noise and degradation as well as overlap instances over the 30 channels. For our study, we have selected a subset of 100 hours manually transcribed by professional annotators for speaker labels. The 100 hours are obtained from three mission critical events: 1. Lift-Off (25 hours) 2. Lunar-Landing (50 hours) 3. Lunar-Walking (25 hours). Five channels of interest, out of 30 channels were selected with the most speech activity, the primary speakers operating these five channels are command/owners of these channels. For our analysis, we select five speaker roles: Flight Director (FD), Capsule Communicator (CAPCOM), Guidance, Navigation and, Control (GNC), Electrical, environmental, and consumables manager (EECOM), and Network (NTWK). To track and tag individual speakers across our Fearless Steps audio dataset, we use the concept of ‘where’s Waldo’ to identify all instances of our speakers-of-interest across a cluster of other speakers. Also, to understand speaker roles of our speaker-of-interests, we use speaker duration of primary speaker vs secondary speaker and speaker turns as our metrics to determine the role of the speaker and to understand their responsibility during the three critical phases of the mission. This enables a content linking capability as well as provide a pathway to analyzing group engagement, group dynamics of people working together in an enclosed space, psychological effects, and cognitive analysis in such individuals. IMPACT: NASA’s Apollo Program stands as one of the most significant contributions to humankind. This collection opens new research options for recognizing team communication, group dynamics, and human engagement/psychology for future deep space missions. Analyzing team communications to achieve such goals would allow for the formulation of educational and training technologies for assessment of STEM knowledge, task learning, and educational feedback. Also, identifying these personnel can help pay tribute and yield personal recognition to the hundreds of notable engineers and scientist who made this feat possible. ILLUSTRATION: In this work, we propose to illustrate how a pre-trained speech/language network can be used to obtain powerful speaker embeddings needed for speaker diarization. This framework is used to build these learned embeddings to label unique speakers over sustained audio streams. To train and test our system, we will make use of Fearless Steps Apollo corpus, allowing us to effectively leverage a limited label information resource (100 hours of labeled data out of +9000 hours). Furthermore, we use the concept of 'Finding Waldo' to identify key speakers of interest (SOI) throughout the Apollo-11 mission audio across multiple channel audio streams. 
    more » « less
  4. Recent developments in deep learning strategies have revolutionized Speech and Language Technologies(SLT). Deep learning models often rely on massive naturalistic datasets to produce the necessary complexity required for generating superior performance. However, most massive SLT datasets are not publicly available, limiting the potential for academic research. Through this work, we showcase the CRSS-UTDallas led efforts to recover, digitize, and openly distribute over 50,000 hrs of speech data recorded during the 12 NASA Apollo manned missions, and outline our continuing efforts to digitize and create meta-data through diarization of the remaining 100,000hrs. We present novel deep learning-based speech processing solutions developed to extract high-level information from this massive dataset. Fearless-Steps APOLLO resource is a 50,000 hrs audio collection from 30-track analog tapes originally used to document Apollo missions 1,7,8,10,11,&13. A customized tape read-head developed to digitize all 30 channels simultaneously has been deployed to expedite digitization of remaining mission tapes. Diarized transcripts for these unlabeled audio communications have also been generated to facilitate open research from speech sciences, historical archives, education, and speech technology communities. Robust technologies developed to generate human-readable transcripts include: (i) speaker diarization, (ii) speaker tracking, and (iii) text output from speech recognition systems. 
    more » « less
  5. INTRODUCTION: CRSS-UTDallas initiated and oversaw the efforts to recover APOLLO mission communications by re-engineering the NASA SoundScriber playback system, and digitizing 30-channel analog audio tapes – with the entire Apollo-11, Apollo-13, and Gemini-8 missions during 2011-17 [1,6]. This vast data resource was made publicly available along with supplemental speech & language technologies meta-data based on CRSS pipeline diarization transcripts and conversational speaker time-stamps for Apollo team at NASA Mission Control Center, [2,4]. In 2021, renewed efforts over the past year have resulted in the digitization of an additional +50,000hrs of audio from Apollo 7,8,9,10,12 missions, and remaining A-13 tapes. Cumulative digitization efforts have enabled the development of the largest publicly available speech data resource with unprompted, real conversations recorded in naturalistic environments. Deployment of this massive corpus has inspired multiple collaborative initiatives such as Web resources ExploreApollo (https://app.exploreapollo.org) LanguageARC (https://languagearc.com/projects/21) [3]. ExploreApollo.org serves as the visualization and play-back tool, and LanguageARC the crowd source subject content tagging resource developed by UG/Grad. Students, intended as an educational resource for k-12 students, and STEM/Apollo enthusiasts. Significant algorithmic advancements have included advanced deep learning models that are now able to improve automatic transcript generation quality, and even extract high level knowledge such as ID labels of topics being spoken across different mission stages. Efficient transcript generation and topic extraction tools for this naturalistic audio have wide applications including content archival and retrieval, speaker indexing, education, group dynamics and team cohesion analysis. Some of these applications have been deployed in our online portals to provide a more immersive experience for students and researchers. Continued worldwide outreach in the form of the Fearless Steps Challenges has proven successful with the most recent Phase-4 of the Challenge series. This challenge has motivated research in low level tasks such as speaker diarization and high level tasks like topic identification. IMPACT: Distribution and visualization of the Apollo audio corpus through the above mentioned online portals and Fearless Steps Challenges have produced significant impact as a STEM education resource for K-12 students as well as a SLT development resource with real-world applications for research organizations globally. The speech technologies developed by CRSS-UTDallas using the Fearless Steps Apollo corpus have improved previous benchmarks on multiple tasks [1, 5]. The continued initiative will extend the current digitization efforts to include over 150,000 hours of audio recorded during all Apollo missions. ILLUSTRATION: We will demonstrate WebExploreApollo and LanguageARC online portals with newly digitized audio playback in addition to improved SLT baseline systems, the results from ASR and Topic Identification systems which will include research performed on the corpus conversational. Performance analysis visualizations will also be illustrated. We will also display results from the past challenges and their state-of-the-art system improvements. 
    more » « less
  6. Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs. controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates an alternate Deep Learning-based diarization solution for segmenting classroom interactions of 3-5 year old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our proposed ResNet model achieves a best F1-score of ∼71.0% on data from two classrooms, based on dev and test sets of each classroom. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs. child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations. 
    more » « less
  7. The use of wh-words, including wh-questions and wh-clauses, can be linguistically, conceptually, and interactively challenging to preschoolers. Young children develop mastery of wh-words as they formulate and hear these words during daily interactions in contexts such as preschool classrooms. Observational approaches limit researchers' ability to comprehensively capture the classroom conversations, including wh-words. In the current study, we report the results of the first study using the automated speech recognition (ASR) system coupled with location sensors designed to quantify teachers' wh-words in the literacy activity areas of a preschool classroom. We found that the ASR system is a viable solution to automatically quantify the number of adult wh-words used in preschool classrooms. Our findings demonstrated that the most frequently used adult wh-word type was "what." Classroom adults used more wh-words during time point 1 compared to time point 2. Lastly, a child at risk for developmental delays heard more wh-words per minute than a typically developing child. Future research is warranted to further improve the efforts 
    more » « less
  8. Adult-child interaction is an important component for language development in young children. Teachers responsible for the language acquisition of their students have a vested interest in improving such conversation in their classrooms. Advancements in speech technology and natural language processing can be used as an effective tool by teachers in pre-school classrooms to acquire large amounts of conversational data, receive feedback from automated conversational analysis, and amend their teaching methods. Measuring engagement among pre-school children and teachers is a challenging task and not well defined. In this study, we focus on developing criteria to measure conversational turn-taking and topic initiation during adult-child interactions in preschool environments. However, counting conversational turns, conversation initiations, or vocabulary alone is not enough to judge the quality of a conversation and track language acquisition. It is necessary to use a combination of the three and include a measurement of the complexity of vocabulary. The next iterative of this problem is to deploy various solutions from speech and language processing technology to automate these measurements. * (2022 ASEE Best Student Paper Award Winner) 
    more » « less
  9. null (Ed.)
    Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling. The proposed Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e., 200 ms). All the classification metrics exceed 92% for the attention-guided model in offline scenarios where the input signal is more than 100 frames long (i.e., 1s). 
    more » « less
  10. Speech and language development in children are crucial for ensuring effective skills in their long-term learning ability. A child’s vocabulary size at the time of entry into kindergarten is an early indicator of their learning ability to read and potential long-term success in school. The preschool classroom is thus a promising venue for assessing growth in young children by measuring their interactions with teachers as well as classmates. However, to date limited studies have explored such naturalistic audio communications. Automatic Speech Recognition (ASR) technologies provide an opportunity for ’Early Childhood’ researchers to obtain knowledge through automatic analysis of naturalistic classroom recordings in measuring such interactions. For this purpose, 208 hours of audio recordings across 48 daylong sessions are collected in a childcare learning center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. Approximately 29 hours of adult speech and 26 hours of child speech is segmented using manual transcriptions provided by CRSS transcription team. Traditional as well as End-to-End ASR models are trained on adult/child speech data subset. Factorized Time Delay Neural Network provides a best Word-Error-Rate (WER) of 35.05% on the adult subset of the test set. End-to-End transformer models achieve 63.5% WER on the child subset of the test data. Next, bar plots demonstrating the frequency of WH-question words in Science vs. Reading activity areas of the preschool are presented for sessions in the test set. It is suggested that learning spaces could be configured to encourage greater adult-child conversational engagement given such speech/audio assessment strategies. 
    more » « less