Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels.more » « lessFree, publicly-accessible full text available August 20, 2024
null (Ed.)The performance of facial expression recognition (FER) systems has improved with recent advances in machine learning. While studies have reported impressive accuracies in detecting emotion from posed expressions in static images, there are still important challenges in developing FER systems for videos, especially in the presence of speech. Speech articulation modulates the orofacial area, changing the facial appearance. These facial movements induced by speech introduce noise, reducing the performance of an FER system. Solving this problem is important if we aim to study more naturalistic environment or applications in the wild. We propose a novel approach to compensate for lexical information that does not require phonetic information during inference. The approach relies on a style extractor model, which creates emotional-to-neutral transformations. The transformed facial representations are spatially contrasted with the original faces, highlighting the emotional information conveyed in the video. The results demonstrate that adding the proposed style extractor model to a dynamic FER system improves the performance by 7% (absolute) compared to a similar model with no style extractor. This novel feature representation also improves the generaliza- tion of the model.more » « less