A key challenge in e-learning environments like Intelligent Tutoring Systems (ITSs) is to induce effective pedagogical policies efficiently. While Deep Reinforcement Learning (DRL) often suffers from \textbf{\emph{sample inefficiency}} and \textbf{\emph{reward function}} design difficulty, Apprenticeship Learning (AL) algorithms can overcome them. However, most AL algorithms can not handle heterogeneity as they assume all demonstrations are generated with a homogeneous policy driven by a single reward function. Still, some AL algorithms which consider heterogeneity, often can not generalize to large continuous state space and only work with discrete states. In this paper, we propose an expectation-maximization(EM)-EDM, a general AL framework to induce effective pedagogical policies from given optimal or near-optimal demonstrations, which are assumed to be driven by heterogeneous reward functions. We compare the effectiveness of the policies induced by our proposed EM-EDM against four AL-based baselines and two policies induced by DRL on two different but related tasks that involve pedagogical action prediction. Our overall results showed that, for both tasks, EM-EDM outperforms the four AL baselines across all performance metrics and the two DRL baselines. This suggests that EM-EDM can effectively model complex student pedagogical decision-making processes through the ability to manage a large, continuous state space and adapt to handle diverse and heterogeneous reward functions with very few given demonstrations.
more »
« less
“Improving Learning & Reducing Time: A Constrained Action-Based Reinforcement Learning Approach.”
Constrained action-based decision-making is one of the most challenging decision-making problems. It refers to a scenario where an agent takes action in an environment not only to maximize the expected cumulative reward but where it is subject to certain actionbased constraints; for example, an upper limit on the total number of certain actions being carried out. In this work, we construct a general data-driven framework called Constrained Action-based Partially Observable Markov Decision Process (CAPOMDP) to induce effective pedagogical policies. Specifically, we induce two types of policies: CAPOMDP-LG using learning gain as reward with the goal of improving students’ learning performance, and CAPOMDP-Time using time as reward for reducing students’ time on task. The effectiveness ofCAPOMDP-LG is compared against a random yet reasonable policy and the effectiveness of CAPOMDP-Time is compared against both a Deep Reinforcement Learning induced policy and a random policy. Empirical results show that there is an Aptitude Treatment Interaction effect: students are split into High vs. Low based on their incoming competence; while no significant difference is found among the High incoming competence groups, for the Low groups, students following CAPOMDP-Time indeed spent significantly less time than those using the two baseline policies and students following CAPOMDP-LG significantly outperform their peers on both learning gain and learning efficiency.
more »
« less
- Award ID(s):
- 1726550
- PAR ID:
- 10065908
- Date Published:
- Journal Name:
- In Proceedings of 26th ACM International Conference on User Modeling, Adaptation, and Personalization Conference (ACM UMAP-2018)
- Page Range / eLocation ID:
- 43-51
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The effectiveness of Intelligent Tutoring Systems (ITSs) often depends upon their pedagogical strategies, the policies used to decide what action to take next in the face of alternatives. We induce policies based on two general Reinforcement Learning (RL) frameworks: POMDP &. MDP, given the limited feature space. We conduct an empirical study where the RL-induced policies are compared against a random yet reasonable policy. Results show that when the contents are controlled to be equal, the MDP-based policy can improve students’ learning significantly more than the random baseline while the POMDP-based policy cannot outperform the later. The possible reason is that the features selected for the MDP framework may not be the optimal feature space for POMDP.more » « less
-
null (Ed.)An important goal in the design and development of Intelligent Tutoring Systems (ITSs) is to have a system that adaptively reacts to students’ behavior in the short term and effectively improves their learning performance in the long term. Inducing effective pedagogical strategies that accomplish this goal is an essential challenge. To address this challenge, we explore three aspects of a Markov Decision Process (MDP) framework through four experiments. The three aspects are: 1) reward function, detecting the impact of immediate and delayed reward on effectiveness of the policies; 2) state representation, exploring ECR-based, correlation-based, and ensemble feature selection approaches for representing the MDP state space; and 3) policy execution, investigating the effectiveness of stochastic and deterministic policy executions on learning. The most important result of this work is that there exists an aptitude-treatment interaction (ATI) effect in our experiments: the policies have significantly different impacts on the particular types of students as opposed to the entire population. We refer the students who are sensitive to the policies as the Responsive group. All our following results are based on the Responsive group. First, we find that an immediate reward can facilitate a more effective induced policy than a delayed reward. Second, The MDP policies induced based on low correlation-based and ensemble feature selection approaches are more effective than a Random yet reasonable policy. Third, no significant improvement was found using stochastic policy execution due to a ceiling effect.more » « less
-
An important goal in the design and development of Intelligent Tutoring Systems (ITSs) is to have a system that adaptively reacts to students’ behavior in the short term and effectively improves their learning performance in the long term. Inducing effective pedagogical strategies that accomplish this goal is an essential challenge. To address this challenge, we explore three aspects of a Markov Decision Process (MDP) framework through four experiments. The three aspects are: 1) reward function, detecting the impact of immediate and delayed reward on effectiveness of the policies; 2) state representation, exploring ECR-based, correlation-based, and ensemble feature selection approaches for representing the MDP state space; and 3) policy execution, investigating the effectiveness of stochastic and deterministic policy executions on learning. The most important result of this work is that there exists an aptitude-treatment interaction (ATI) effect in our experiments: the policies have significantly different impacts on the particular types of students as opposed to the entire population. We refer the students who are sensitive to the policies as the Responsive group. All our following results are based on the Responsive group. First, we find that an immediate reward can facilitate a more effective induced policy than a delayed reward. Second, The MDP policies induced based on low correlation-based and ensemble feature selection approaches are more effective than a Random yet reasonable policy. Third, no significant improvement was found using stochastic policy execution due to a ceiling effect.more » « less
-
An important goal in the design and development of Intelligent Tutoring Systems (ITSs) is to have a system that adaptively reacts to students’ behavior in the short term and effectively improves their learning performance in the long term. Inducing effective pedagogical strategies that accomplish this goal is an essential challenge. To address this challenge, we explore three aspects of a Markov Decision Process (MDP) framework through four experiments. The three aspects are: 1) reward function, detecting the impact of immediate and delayed reward on effectiveness of the policies; 2) state representation, exploring ECR-based, correlation-based, and ensemble feature selection approaches for representing the MDP state space; and 3) policy execution, investigating the effectiveness of stochastic and deterministic policy executions on learning. The most important result of this work is that there exists an aptitude-treatment interaction (ATI) effect in our experiments: the policies have significantly different impacts on the particular types of students as opposed to the entire population. We refer the students who are sensitive to the policies as the Responsive group. All our following results are based on the Responsive group. First, we find that an immediate reward can facilitate a more effective induced policy than a delayed reward. Second, The MDP policies induced based on low correlation-based and ensemble feature selection approaches are more effective than a Random yet reasonable policy. Third, no significant improvement was found using stochastic policy execution due to a ceiling effect.more » « less
An official website of the United States government

