skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Comparing Approaches to Language Understanding for Human-Robot Dialogue: An Error Taxonomy and Analysis
In this paper, we compare two different approaches to language understanding for a human-robot interaction domain in which a human commander gives navigation instructions to a robot. We contrast a relevance-based classifier with a GPT-2 model, using about 2000 input-output examples as training data. With this level of training data, the relevance-based model outperforms the GPT-2 based model 79% to 8%. We also present a taxonomy of types of errors made by each model, indicating that they have somewhat different strengths and weaknesses, so we also examine the potential for a combined model.  more » « less
Award ID(s):
1925576
PAR ID:
10496421
Author(s) / Creator(s):
;
Publisher / Repository:
ELRA
Date Published:
Journal Name:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Page Range / eLocation ID:
5813-5820
Format(s):
Medium: X
Location:
https://aclanthology.org/2022.lrec-1.625/
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper compares different methods of using a large language model (GPT-3.5) for creating synthetic training data for a retrieval-based conversational character. The training data are in the form of linked questions and answers, which allow a classifier to retrieve a pre-recorded answer to an unseen question; the intuition is that a large language model could predict what human users might ask, thus saving the effort of collecting real user questions as training data. Results show small improvements in test performance for all synthetic datasets. However, a classifier trained on only small amounts of collected user data resulted in a higher F-score than the classifiers trained on much larger amounts of synthetic data generated using GPT-3.5. Based on these results, we see a potential in using large language models for generating training data, but at this point it is not as valuable as collecting actual user data for training. 
    more » « less
  2. Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT-4, in automating or assisting emotion annotation. We compare GPT-4 with supervised models and/or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate GPT-4's performance, and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies. 
    more » « less
  3. This study explores the use of Large Language Models (LLMs), specifically GPT, for different safety management applications in the construction industry. Many studies have explored the integration of GPT in construction safety for various applications; their primary focus has been on the feasibility of such integration, often using GPT models for specific applications rather than a thorough evaluation of GPT’s limitations and capabilities. In contrast, this study aims to provide a comprehensive assessment of GPT’s performance based on established key criteria. Using structured use cases, this study explores GPT’s strength and weaknesses in four construction safety areas: (1) delivering personalized safety training and educational content tailored to individual learner needs; (2) automatically analyzing post-accident reports to identify root causes and suggest preventive measures; (3) generating customized safety guidelines and checklists to support site compliance; and (4) providing real-time assistance for managing daily safety tasks and decision-making on construction sites. LLMs and NLP have already been employed in each of these four areas for improvement, making them suitable areas for further investigation. GPT demonstrated acceptable performance in delivering evidence-based, regulation-aligned responses, making it valuable for scaling personalized training, automating accident analyses, and developing safety protocols. Additionally, it provided real-time safety support through interactive dialogues. However, the model showed limitations in deeper critical analysis, extrapolating information, and adapting to dynamic environments. The study concludes that while GPT holds significant promise for enhancing construction safety, further refinement is necessary. This includes fine-tuning for more relevant safety-specific outcomes, integrating real-time data for contextual awareness, and developing a nuanced understanding of safety risks. These improvements, coupled with human oversight, could make GPT a robust tool for safety management. 
    more » « less
  4. We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer’s and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT ‘hallucinations’, we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and ‘citizen scientists’ on the relevance of GPT responses. A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. When evaluating the responses to their questions, human respondents give a “thumbs-up” 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT-generates questions, RAGAS scores 74% on answer relevance. With RAGAS we created a benchmark that can be used to continuously assess the performance of our knowledge base. Full GNQA functionality is embedded in the freeGeneNetwork.orgweb service, an open-source system containing over 25 years of experimental data on model organisms and human. The code developed for this study is published under a free and open-source software license athttps://git.genenetwork.org/gn-ai/tree/README.md. 
    more » « less
  5. This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies — Zero-shot, Few-shot, and Few-shot with contextual information — as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct’s degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on. 
    more » « less