skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Human-Guided Generative AI for Proactive STEM Student Support
Student retention in undergraduate STEM programs is frequently undermined not by cognitive deficits, but by non-cognitive factors such as declining confidence, performance-related worry, and negative self-evaluation. Traditional early-warning systems, which rely on lagging indicators such as grades, often fail to intervene before disengagement solidifies. This work-in-progress paper presents a human-guided generative AI intervention framework that leverages large language models (LLMs) in two components combined with human expertise: (1) an LLM-based forecasting model that predicts weekly student engagement states from longitudinal self-report data, and (2) a personalization model (GPT-4o) that contextualizes human-expert-crafted intervention messages. A counseling psychology expert designs messages based on the forecasted engagement states, and GPT-4o personalizes these messages while preserving the expert's core content. We deploy this system with 41 engineering students and compare outcomes against a historical baseline cohort (N=96). The intervention group maintains significantly higher Performance Self-Evaluation throughout the semester (75.4% vs. 56.5% positive responses), with a large effect size (Cohen's d = 0.90, 95% CI [11.72, 26.14], t(91) = 5.22, p < 0.001). These preliminary findings suggest that human-guided generative AI systems hold promise for scaling personalized support in large-enrollment STEM courses.  more » « less
Award ID(s):
2142558
PAR ID:
10666445
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
American Society For Engineering Education
Date Published:
Format(s):
Medium: X
Location:
Charlotte, North Carolina
Sponsoring Org:
National Science Foundation
More Like this
  1. Student retention in undergraduate STEM programs is frequently undermined not by cognitive deficits, but by non-cognitive factors such as declining confidence, performance-related worry, and negative self-evaluation. Traditional early-warning systems, which rely on lagging indicators such as grades, often fail to intervene before disengagement solidifies. This work-in-progress paper presents a human-guided generative AI intervention framework that leverages large language models (LLMs) in two components combined with human expertise: (1) an LLM-based forecasting model that predicts weekly student engagement states from longitudinal self-report data, and (2) a personalization model (GPT-4o) that contextualizes human-expert-crafted intervention messages. A counseling psychology expert designs messages based on the forecasted engagement states, and GPT-4o personalizes these messages while preserving the expert's core content. We deploy this system with 41 engineering students and compare outcomes against a historical baseline cohort (N=96). The intervention group maintains significantly higher Performance Self-Evaluation throughout the semester (75.4% vs. 56.5% positive responses), with a large effect size (Cohen's d = 0.90, 95% CI [11.72, 26.14], t(91) = 5.22, p < 0.001). These preliminary findings suggest that human-guided generative AI systems hold promise for scaling personalized support in large-enrollment STEM courses. 
    more » « less
  2. Artificial Intelligence (AI) is a transformative force in communication and messaging strategy, with potential to disrupt traditional approaches. Large language models (LLMs), a form of AI, are capable of generating high-quality, humanlike text. We investigate the persuasive quality of AI-generated messages to understand how AI could impact public health messaging. Specifically, through a series of studies designed to characterize and evaluate generative AI in developing public health messages, we analyze COVID-19 pro-vaccination messages generated by GPT-3, a state-of-the-art instantiation of a large language model. Study 1 is a systematic evaluation of GPT-3's ability to generate pro-vaccination messages. Study 2 then observed peoples' perceptions of curated GPT-3-generated messages compared to human-authored messages released by the CDC (Centers for Disease Control and Prevention), finding that GPT-3 messages were perceived as more effective, stronger arguments, and evoked more positive attitudes than CDC messages. Finally, Study 3 assessed the role of source labels on perceived quality, finding that while participants preferred AI-generated messages, they expressed dispreference for messages that were labeled as AI-generated. The results suggest that, with human supervision, AI can be used to create effective public health messages, but that individuals prefer their public health messages to come from human institutions rather than AI sources. We propose best practices for assessing generative outputs of large language models in future social science research and ways health professionals can use AI systems to augment public health messaging. 
    more » « less
  3. As large language models (LLMs) show great promise in generating a wide spectrum of educational materials, robust yet cost-effective assessment of the quality and effectiveness of such materials becomes an important challenge. Traditional approaches, including expert-based quality assessment and student-centered evaluation, are resource-consuming, and do not scale efficiently. In this work, we explored the use of pre-existing student learning data as a promising approach to evaluate LLM-generated learning materials. Specifically, we used a dataset where students were completing the program construction challenges by picking the correct answers among human-authored distractors to evaluate the quality of LLM-generated distractors for the same challenges. The dataset included responses from 1,071 students across 22 classes taught from Fall 2017 to Spring 2023. We evaluated five prominent LLMs (OpenAI-o1, GPT-4, GPT-4o, GPT-4o-mini, and Llama-3.1-8b) across three different prompts to see which combinations result in more effective distractors, i.e., those that are plausible (often picked by students), and potentially based on common misconceptions. Our results suggest that GPT-4o was the most effective model, matching close to 50% of the functional distractors originally authored by humans. At the same time, all of the evaluated LLMs generated many novel distractors, i.e., those that did not match the pre-existing human-authored ones. Our preliminary analysis shows that those appear to be promising. Establishing their effectiveness in real-world classroom settings is left for future work. 
    more » « less
  4. The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios. 
    more » « less
  5. Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation's objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings. 
    more » « less