skip to main content


This content will become publicly available on September 1, 2025

Title: From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT-4, in automating or assisting emotion annotation. We compare GPT-4 with supervised models and/or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate GPT-4's performance, and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.  more » « less
Award ID(s):
2230172
PAR ID:
10536399
Author(s) / Creator(s):
; ;
Publisher / Repository:
Interspeech
Date Published:
Format(s):
Medium: X
Location:
Kos, Greece
Sponsoring Org:
National Science Foundation
More Like this
  1. Jiang, Jing ; Reitter, David ; Deng, Shumin (Ed.)
    This paper explores utilizing Large Language Models (LLMs) to perform Cross-Document Event Coreference Resolution (CDEC) annotations and evaluates how they fare against human annotators with different levels of training. Specifically, we formulate CDEC as a multi-category classification problem on pairs of events that are represented as decontextualized sentences, and compare the predictions of GPT-4 with the judgment of fully trained annotators and crowdworkers on the same data set. Our study indicates that GPT-4 with zero-shot learning outperformed crowd-workers by a large margin and exhibits a level of performance comparable to trained annotators. Upon closer analysis, GPT-4 also exhibits tendencies of being overly confident, and force annotation decisions even when such decisions are not warranted due to insufficient information. Our results have implications on how to perform complicated annotations such as CDEC in the age of LLMs, and show that the best way to acquire such annotations might be to combine the strengths of LLMs and trained human annotators in the annotation process, and using untrained or undertrained crowdworkers is no longer a viable option to acquire high-quality data to advance the state of the art for such problems. 
    more » « less
  2. Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection. 
    more » « less
  3. Bonial, Claire ; Bonn, Julia ; Hwang, Jena D (Ed.)
    We explore using LLMs, GPT-4 specifically, to generate draft sentence-level Chinese Uniform Meaning Representations (UMRs) that human annotators can revise to speed up the UMR annotation process. In this study, we use few-shot learning and Think-Aloud prompting to guide GPT-4 to generate sentence-level graphs of UMR. Our experimental results show that compared with annotating UMRs from scratch, using LLMs as a preprocessing step reduces the annotation time by two thirds on average. This indicates that there is great potential for integrating LLMs into the pipeline for complicated semantic annotation tasks. 
    more » « less
  4. Bonial, Claire ; Bonn, Julia ; Hwang, Jena D (Ed.)
    We explore using LLMs, GPT-4 specifically, to generate draft sentence-level Chinese Uniform Meaning Representations (UMRs) that human annotators can revise to speed up the UMR annotation process. In this study, we use few-shot learning and Think-Aloud prompting to guide GPT-4 to generate sentence-level graphs of UMR. Our experimental results show that compared with annotating UMRs from scratch, using LLMs as a preprocessing step reduces the annotation time by two thirds on average. This indicates that there is great potential for integrating LLMs into the pipeline for complicated semantic annotation tasks. 
    more » « less
  5. Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models. 
    more » « less