The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked.
more »
« less
Effectiveness of Voice Quality Features in Detecting Depression
Automatic assessment of depression from speech signals is affected by variabilities in acoustic content and speakers. In this study, we focused on addressing these variabilities. We used a database comprised of recordings of interviews from a large number of female speakers: 735 individuals suffering from depressive (dysthymia and major depression) and anxiety disorders (generalized anxiety disorder, panic disorder with or without agoraphobia) and 953 healthy individuals. Leveraging this unique and extensive database, we built an i-vector framework. In order to capture various aspects of speech signals, we used voice quality features in addition to conventional cepstral features. The features (F0, F1, F2, F3, H1-H2, H2-H4, H4-H2k, A1, A2, A3, and CPP) were inspired by a psychoacoustic model of voice quality [1]. An i-vector-based system using Mel Frequency Cepstral Coefficients (MFCCs) and another using voice quality features was developed. Voice quality features performed as well as MFCCs. A score-level fusion was then used to combine these two systems, resulting in a 6% relative improvement in accuracy in comparison with the i-vector system based on MFCCs alone. The system was robust even when the duration of the utterances was shortened to 10 seconds.
more »
« less
- Award ID(s):
- 1704167
- PAR ID:
- 10098305
- Date Published:
- Journal Name:
- Interspeech 2018
- Page Range / eLocation ID:
- 1676 to 1680
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Although social anxiety and depression are common, they are often under diagnosed and under treated, in part due to difficulties identifying and accessing individuals in need of services. Current assessments rely on client self-report and clinician judgment, which are vulnerable to social desirability and other subjective biases. Identifying objective, non-burdensome markers of these mental health problems, such as features of speech, could help advance assessment, prevention, and treatment approaches. Prior research examining speech detection methods has focused on fully supervised learning approaches employing strongly labeled data. However, strong labeling of individuals high in symptoms or state affect in speech audio data is impractical, in part because it is not possible to identify with high confidence which regions of a long speech indicate the person’s symptoms or affective state. We propose a weakly supervised learning framework for detecting social anxiety and depression from long audio clips. Specifically, we present a novel feature modeling technique named NN2Vec that identies and exploits the inherent relationship between speakers’ vocal states and symptoms/affective states. Detecting speakers high in social anxiety or depression symptoms using NN2Vec features achieves F-1 scores 17% and 13% higher than those of the best available baselines. In addition, we present a new multiple instance learning adaptation of a BLSTM classifier, named BLSTM-MIL. Our novel framework of using NN2Vec features with the BLSTM-MIL classifier achieves F-1 scores of 90.1% and 85.44% in detecting speakers high in social anxiety and depression symptoms.more » « less
-
null (Ed.)Background: Individuals affected by disasters are at risk for adverse mental health sequelae. Individuals living in the US Gulf Coast have experienced many recent major disasters, but few studies have explored the cumulative burden of experiencing multiple disasters on mental health. Objective: The objective of this study was to evaluate the relationship between disaster burden and mental health. Methods: We used data from 9278 Gulf Long-term Follow-up Study participants who completed questionnaires on perceived stress, anxiety, depression, and post-traumatic stress disorder (PTSD) in 2011-2013. We linked 2005-2010 county-level data from the Spatial Hazard Events and Losses Database for the United States, a database of loss-causing events, to participant's home address. Exposure measures included total count of loss events as well as severity quantified as property/crop losses per capita from all hazards. We used multilevel modeling to estimate odds ratios (OR) and 95% confidence intervals (CI) for each exposure-outcome relationship. Results: Total count of loss events was positively associated with perceived stress (ORQ4:1.40, 95% CI:1.21-1.61) and was inversely associated with PTSD (ORQ4:0.66, 95% CI:0.45-0.96). Total duration of exposure was also associated with stress (ORQ4:1.16, 95% CI:1.01-1.33) but not with other outcomes. Severity based on cumulative fatalities/injuries was associated with anxiety (ORQ4:1.31, 95% CI:1.05-1.63) and stress (ORQ4:1.34, 95% CI:1.15-1.57), and severity based on cumulative property/crop losses was associated with anxiety (ORQ4:1.42, 95% CI:1.12-1.81), depression (ORQ4:1.22, 95% CI:0.95-1.57) and PTSD (ORQ4:1.99, 95% CI:1.44-2.76). Keywords: PTSD; disasters; mental health; natural hazards.more » « less
-
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.more » « less
-
This study investigates how individuals with visual disabilities and their sighted counterparts perceive user experiences with smart speakers. A sample of 79 participants, including 41 with visual disabilities and 38 sighted individuals, used Amazon Echo 4th Gen smart speakers. After participants used the smart speakers for one week in their daily lives, exit interviews were administered and analyzed, yielding themes of accessibility, effectiveness, enjoyment, efficiency, and privacy. Findings revealed that the voice user interfaces of smart speakers significantly enhanced accessibility and user satisfaction for those with visual disabilities, while the voice assistant Alexa contributed to fostering emotional connections. Sighted participants, while benefiting from the smart speaker's multifunctionality and efficiency, faced challenges with initial setup and advanced features. Individuals with visual disabilities raised privacy concerns. This study underscores the need for inclusive design improvements to address the diverse needs of all users. To improve user experience, future enhancements should focus on refining voice command accuracy, integrating predictive features, optimizing onboarding processes, and strengthening privacy controls.more » « less
An official website of the United States government

