skip to main content


Search for: All records

Editors contains: "Liu, Z"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Olney, AM ; Chounta, IA ; Liu, Z ; Santos ; OC ; Bittencourt, II (Ed.)
    This work investigates how tutoring discourse interacts with students’ proximal knowledge to explain and predict students’ learning outcomes. Our work is conducted in the context of high-dosage human tutoring where 9th-grade students (N = 1080) attended small group tutorials and individually practiced problems on an Intelligent Tutoring System (ITS). We analyzed whether tutors’ talk moves and students’ performance on the ITS predicted scores on math learning assessments. We trained Random Forest Classifiers (RFCs) to distinguish high and low assessment scores based on tutor talk moves, student’s ITS performance metrics, and their combination. A decision tree was extracted from each RFC to yield an interpretable model. We found AUCs of 0.63 for talk moves, 0.66 for ITS, and 0.77 for their combination, suggesting interactivity among the two feature sources. Specifically, the best decision tree emerged from combining the tutor talk moves that encouraged rigorous thinking and students’ ITS mastery. In essence, tutor talk that encouraged mathematical reasoning predicted achievement for students who demonstrated high mastery on the ITS, whereas tutors’ revoicing of students’ mathematical ideas and contributions was predictive for students with low ITS mastery. Implications for practice are discussed. 
    more » « less
    Free, publicly-accessible full text available July 8, 2025
  2. Olney, AM ; Chounta, IA ; Liu, Z ; Santos, OC ; Bittencourt, II (Ed.)
    This work investigates how tutoring discourse interacts with students’ proximal knowledge to explain and predict students’ learning outcomes. Our work is conducted in the context of high-dosage human tutoring where 9th-grade students attended small group tutorials and individually practiced problems on an Intelligent Tutoring System (ITS). We analyzed whether tutors’ talk moves and students’ performance on the ITS predicted scores on math learning assessments. We trained Random Forest Classifiers (RFCs) to distinguish high and low assessment scores based on tutor talk moves, student’s ITS performance metrics, and their combination. A decision tree was extracted from each RFC to yield an interpretable model. We found AUCs of 0.63 for talk moves, 0.66 for ITS, and 0.77 for their combination, suggesting interactivity among the two feature sources. Specifically, the best decision tree emerged from combining the tutor talk moves that encouraged rigorous thinking and students’ ITS mastery. In essence, tutor talk that encouraged mathematical reasoning predicted achievement for students who demonstrated high mastery on the ITS, whereas tutors’ revoicing of students’ mathematical ideas and contributions was predictive for students with low ITS mastery. Implications for practice are discussed. 
    more » « less
    Free, publicly-accessible full text available July 2, 2025
  3. Olney, AM ; Chounta, IA ; Liu, Z ; Santos, OC ; Bittencourt, II (Ed.)
    An advantage of Large Language Models (LLMs) is their contextualization capability – providing different responses based on student inputs like solution strategy or prior discussion, to potentially better engage students than standard feedback. We present a design and evaluation of a proof-of-concept LLM application to offer students dynamic and contextualized feedback. Specifically, we augment an Online Programming Exercise bot for a college-level Cloud Computing course with ChatGPT, which offers students contextualized reflection triggers during a collaborative query optimization task in database design. We demonstrate that LLMs can be used to generate highly situated reflection triggers that incorporate details of the collaborative discussion happening in context. We discuss in depth the exploration of the design space of the triggers and their correspondence with the learning objectives as well as the impact on student learning in a pilot study with 34 students. 
    more » « less
    Free, publicly-accessible full text available July 2, 2025
  4. Olney, A M ; Chounta, I A ; Liu, Z ; Santos, O C ; Bittencourt, I I (Ed.)
    Middle school students learned about astronomy and STEM concepts while exploring Minecraft simulations of hypothetical Earths and exoplanets. Small groups (n = 24) were tasked with building feasible habitats on Mars. In this paper, we present a scoring scheme for habitat assessment that was used to build novel multi/mixed-input AI models. Using Spearman’s rank correlations, we found that our scoring scheme was reliable with regards to team size and face-to-face instruction time and validated with self-explanation scores. We took an exploratory approach to analyzing image and block data to compare seven different input conditions. Using one-way ANOVAs, we found that the means of the conditions were not equal for accuracy, precision, recall, and F1 metrics. A post hoc Tukey HSD test found that models built using images only were statistically significantly worse than conditions that used block data on the metrics. We also report the results of optimized models using block only data on additional Mars bases (n = 57). 
    more » « less
    Free, publicly-accessible full text available July 2, 2025
  5. Moore, S ; Stamper, J ; Cao, T ; Liu, Z ; Hu, X ; Lu, Y ; Liang, J ; Khosravi, H ; Denny, P ; Singh, A (Ed.)
    Multiple choice questions are traditionally expensive to produce. Recent advances in large language models (LLMs) have led to fine-tuned LLMs that generate questions competitive with human-authored questions. However, the relative capabilities of ChatGPT-family models have not yet been established for this task. We present a carefully-controlled human evaluation of three conditions: a fine-tuned, augmented version of Macaw, instruction-tuned Bing Chat with zero-shot prompting, and humanauthored questions from a college science textbook. Our results indicate that on six of seven measures tested, both LLM’s performance was not significantly different from human performance. Analysis of LLM errors further suggests that Macaw and Bing Chat have different failure modes for this task: Macaw tends to repeat answer options whereas Bing Chat tends to not include the specified answer in the answer options. For Macaw, removing error items from analysis results in performance on par with humans for all metrics; for Bing Chat, removing error items improves performance but does not reach human-level performance. 
    more » « less