skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Complex Conversations: LLM vs. Knowledge Engineering Conversation-based Assessment
This paper explores the differences between two types of natural language conversations between a student and pedagogical agent(s). Both types of conversations were created for formative assessment purposes. The first type is conversation-based assessment created via knowledge engineering which requires a large amount of human effort. The second type, which is less costly to produce, uses prompt engineering for LLMs based on Evidence-Centered design to create these conversations and glean evidence about students¿½f knowledge, skills and abilities. The current work compares linguistic features of the artificial agent(s) discourse moves in natural language conversations created by the two methodologies. Results indicate that more complex conversations are created by the prompt engineering method which may be more adaptive than the knowledge engineering approach. However, the affordances of prompt engi-neered, LLM generated conversation-based assessment may create more challenges for scoring than the original knowledge engineered conversations. Limitations and implications are dis-cussed.  more » « less
Award ID(s):
2229612
PAR ID:
10591727
Author(s) / Creator(s):
; ; ;
Editor(s):
Benjamin, Paaßen; Carrie, Demmans Epp
Publisher / Repository:
International Educational Data Mining Society
Date Published:
Journal Name:
Journal of educational data mining
ISSN:
2157-2100
Page Range / eLocation ID:
868–871
Format(s):
Medium: X
Right(s):
Creative Commons Attribution 4.0 International
Sponsoring Org:
National Science Foundation
More Like this
  1. The current paper discusses conversation-based assessments (CBAs) created with prompt engineering for LLMs based on Evidence-Centered Design (ECD). Conversation-based assessments provide students the opportunity to discuss a given topic with artificial agent(s). These conversations elicit evidence of students’ knowledge, skills and abilities that may not be uncovered by traditional tests. We discuss our previous method of creating such conversations with regular expressions and latent semantic analysis in an expensive methodology requiring time and various expertise. Thus, in this novel work, we created a prompt-engineered version of CBAs based on evidence-centered design that remains on the domain topic throughout the conversation as well as provides evidence of the student knowledge in a less expensive way. We present the methodology for creating these prompts, compare responses to various student speech acts between the previous version and the prompt engineered version, and discuss the evidence gleaned from the conversation and based on the prompt. Finally, limitations, conclusions and implications of this work are discussed. 
    more » « less
  2. Abstract In an information-seeking conversation, a user may ask questions that are under-specified or unanswerable. An ideal agent would interact by initiating different response types according to the available knowledge sources. However, most current studies either fail to or artificially incorporate such agent-side initiative. This work presents InSCIt, a dataset for Information-Seeking Conversations with mixed-initiative Interactions. It contains 4.7K user-agent turns from 805 human-human conversations where the agent searches over Wikipedia and either directly answers, asks for clarification, or provides relevant information to address user queries. The data supports two subtasks, evidence passage identification and response generation, as well as a human evaluation protocol to assess model performance. We report results of two systems based on state-of-the-art models of conversational knowledge identification and open-domain question answering. Both systems significantly underperform humans, suggesting ample room for improvement in future studies.1 
    more » « less
  3. Simulations are widely used to teach science in grade schools. These Ralph Knipper rak0035@auburn.edu Auburn University Auburn, Alabama, USA Sadhana Puntambekar puntambekar@education.wisc.edu University of Wisconsin-Madison Madison, Wisconsin, USA Large Language Models, Conversational AI, Meta-Conversation, simulations are often augmented with a conversational artificial intelligence (AI) agent to provide real-time scaffolding support for students conducting experiments using the simulations. AI agents are highly tailored for each simulation, with a predesigned set of Instructional Goals (IGs). This makes it difficult for teachers to adjust IGs as the agent may no longer align with the revised IGs. Additionally, teachers are hesitant to adopt new third-party simulations for the same reasons. In this research, we introduce SimPal, a Large Language Model (LLM) based meta-conversational agent, to solve this misalignment issue between a pre-trained conversational AI agent and the constantly evolving pedagogy of instructors. Through natural conversation with SimPal, teachers first explain their desired IGs, based on which SimPal identifies a set of relevant physical variables and their relationships to create symbolic representations of the desired IGs. The symbolic representations can then be leveraged to design prompts for the original AI agent to yield better alignment with the desired IGs. We empirically evaluated SimPal using two LLMs, ChatGPT-3.5 and PaLM 2, on 63 Physics simulations from PhET and Golabz. Additionally, we examined the impact of different prompting techniques on LLM’s performance by utilizing the TELeR taxonomy to identify relevant physical variables for the IGs. Our findings showed that SimPal can do this task with a high degree of accuracy when provided with a well-defined prompt. 
    more » « less
  4. Yvette Graham, Matthew Purver (Ed.)
    Understanding the dynamics of counseling conversations is an important task, yet it is a challenging NLP problem regardless of the recent advance of Transformer-based pre-trained language models. This paper proposes a systematic approach to examine the efficacy of domain knowledge and large language models (LLMs) in better representing conversations between a crisis counselor and a help seeker. We empirically show that state-of-the-art language models such as Transformer-based models and GPT models fail to predict the conversation outcome. To provide richer context to conversations, we incorporate human-annotated domain knowledge and LLM-generated features; simple integration of domain knowledge and LLM features improves the model performance by approximately 15%. We argue that both domain knowledge and LLM-generated features can be exploited to better characterize counseling conversations when they are used as an additional context to conversations. 
    more » « less
  5. It has long been observed that humans in interaction tend to adapt their behavior to become more similar to their interlocutors. Yet the reasons why such entrainment arises in some conversations and not others remain poorly understood. Early work suggests an influence of different personality traits on the degree of entrainment speakers engage in. However, some of these results have never been replicated to test their generalizability. Moreover, a recent finding draws into question whether these and other effects are strong enough to create differences between speakers that persist across multiple conversations. We investigate a variety of personality traits for their influence on a local form of acoustic-prosodic entrainment in two kinds of spontaneous conversation. To our knowledge, this is the first attempt to detect such effects across several interactions per subject and the first attempt to replicate some influential early work in a more natural context. We find virtually no impact of personality, suggesting that prior results might not generalize to a more natural context or entrainment on other linguistic variables. 
    more » « less