skip to main content

Title: 9.8 A 25mm 2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET
Award ID(s):
1718160 1704834
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
International Solid-State Circuits Conference
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages. 
    more » « less
  2. Abstract

    Prior work in speech processing indicates that listening tasks with multiple speakers (as opposed to a single speaker) result in slower and less accurate processing. Notably, the trial-to-trial cognitive demands of switching between speakers or switching between accents have yet to be examined. We used pupillometry, a physiological index of cognitive load, to examine the demands of processing first (L1) and second (L2) language-accented speech when listening to sentences produced by the same speaker consecutively (no switch), a novel speaker of the same accent (within-accent switch), and a novel speaker with a different accent (across-accent switch). Inspired by research on sequential adjustments in cognitive control, we aimed to identify the cognitive demands of accommodating a novel speaker and accent by examining the trial-to-trial changes in pupil dilation during speech processing. Our results indicate that switching between speakers was more cognitively demanding than listening to the same speaker consecutively. Additionally, switching to a novel speaker with a different accent was more cognitively demanding than switching between speakers of the same accent. However, there was an asymmetry for across-accent switches, such that switching from an L1 to an L2 accent was more demanding than vice versa. Findings from the present study align with work examining multi-talker processing costs, and provide novel evidence that listeners dynamically adjust cognitive processing to accommodate speaker and accent variability. We discuss these novel findings in the context of an active control model and auditory streaming framework of speech processing.

    more » « less
  3. Herein, we demonstrate that homopolymerization and statistical copolymerization of 2-ethylhexyl thiophene-3-carboxylate and 2-ethylhexyl selenophene-3-carboxylate monomers is possible via Suzuki–Miyaura cross-coupling. A commercially available palladium catalyst ([1,3-bis(2,6-di-3-pentylphenyl)imidazol-2-ylidene](3-chloropyridyl)dichloropalladium( ii ) or PEPPSI-IPent) was employed to prepare regioregular conjugated polymers with high molecular weights (∼20–30 kg mol −1 ), and relatively narrow molecular weight distributions. The optical bandgap in the copolymer series could be reduced by increasing the concentration of selenophene-3-carboxylate in the material. Configurational triads were observed in the 1 H NMR spectra of the statistical copolymers, which were assigned using a combination of 2D NMR techniques. The use of a 1 H– 77 Se HSQC spectrum to further examine sequence distribution in the statistical copolymers revealed how 77 Se NMR can be used as a tool to examine the microstructure of Se-containing conjugated polymers. 
    more » « less