<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Exploring Induced Pedagogical Strategies Through a Markov Decision Process Framework: Lessons Learned</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10136458</idno>
					<idno type="doi"></idno>
					<title level='j'>Journal of educational data mining</title>
<idno>2157-2100</idno>
<biblScope unit="volume">10</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>S. Shen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[An important goal in the design and development of Intelligent Tutoring Systems (ITSs) is to have a system that adaptively reacts to students’ behavior in the short term and effectively improves their learning performance in the long term. Inducing effective pedagogical strategies that accomplish this goal is an essential challenge. To address this challenge, we explore three aspects of a Markov Decision Process (MDP) framework through four experiments. The three aspects are: 1) reward function, detecting the impact of immediate and delayed reward on effectiveness of the policies; 2) state representation, exploring ECR-based, correlation-based, and ensemble feature selection approaches for representing the MDP state space; and 3) policy execution, investigating the effectiveness of stochastic and deterministic policy executions on learning. The most important result of this work is that there exists an aptitude-treatment interaction (ATI) effect in our experiments: the policies have significantly different impacts on the particular types of students as opposed to the entire population. We refer the students who are sensitive to the policies as the Responsive group. All our following results are based on the Responsive group. First, we find that an immediate reward can facilitate a more effective induced policy than a delayed reward. Second, The MDP policies induced based on low correlation-based and ensemble feature selection approaches are more effective than a Random yet reasonable policy. Third, no significant improvement was found using stochastic policy execution due to a ceiling effect.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>In general, the effectiveness of an intelligent tutoring system (ITS) can greatly depend on the implemented pedagogical strategy, which decides the next action for the tutor to take at each step of a student's learning process among a set of alternatives. Using pedagogical strategies, ITSs are able to adaptively interact with students by taking different actions given various situations in the short term in order to improve their learning performance in the long term. However, inducing pedagogical strategies in an ITS is challenging. On one hand, the relation between a tutor's decisions and eventual outcomes cannot be immediately observed. On the other hand, each decision affects the student's subsequent actions and performance, which also has an impact on the tutor's next decision. Therefore, the effectiveness of the current decision depends upon the effectiveness of subsequent decisions, and this iterative process cannot be easily solved by directly optimizing an objective function. Similar to prior work <ref type="bibr">(Chi et al., 2011;</ref><ref type="bibr">Tetreault et al., 2007)</ref>, we apply a classic type of reinforcement learning (RL) framework, the Markov Decision Process (MDP), to address this challenge. In this work, we report our results from exploring the MDP framework from three aspects including reward function, state representation, and policy execution.</p><p>Reward Function. Real-world RL applications often contain two types of rewards: immediate reward, which is the immediate feedback after taking an action, and delayed reward, which is the reward received later after taking more than one action. The longer rewards are delayed, the harder it becomes to assign credit or blame to particular actions or decisions. Moreover, learning short-term performance boosts may not result in long-term learning gains. Thus, in this work we explore both immediate and delayed rewards in our policy induction, and empirically evaluate the impact of the induced policies on student learning. Our results show that using immediate rewards can be more effective than using delayed rewards.</p><p>State Representation. For RL, as with all machine learning tasks, success depends upon an effective state representation. When a student interacts with an ITS, there are many factors that might determine whether the student learns well from the ITS, yet many other factors are not well understood. To make the RL problem tractable, our approach is to begin with a large set of features to which a series of feature-selection methods are applied to reduce them to a tractable subset. In this work, we apply a series of correlation-based feature selection methods to RL: first we explored the option of selecting the next feature that is the most correlated (High option) to the currently selected feature set and then the option of selecting the least correlated (Low option). In general, the features that are most highly correlated with output labels are selected in supervised learning <ref type="bibr">(Yang and Pedersen, 1997;</ref><ref type="bibr">Lee and Lee, 2006;</ref><ref type="bibr">Chandrashekar and Sahin, 2014;</ref><ref type="bibr">Koprinska et al., 2015)</ref>. Since output labels are not present in reinforcement learning, we use correlation to the current feature set as a best approximation. Section 5.6 shows that the highcorrelation option indeed outperformed two baseline methods: the random baseline and also the best feature selection explored in our previous work <ref type="bibr">(Chi et al., 2011)</ref>. However, for our dataset, the high correlation-selected features tend to be homogeneous. Different from the supervised learning tasks, we hypothesize that it is more important to have heterogeneous features in RL that can grasp different aspects of learning environments. Therefore, we also explore the low correlation-based option for feature selection with a goal to increase the diversity of the selected feature set. To do so, we select the next feature that is the least correlated with the currently selected feature set and extend the feature set. Our results show that the low correlation-based option outperformed not only the high option but also the other two baselines.</p><p>Policy Execution. In most of the prior work with RL in ITSs, deterministic policy execution is used. That is, when evaluating the effectiveness of RL-induced policies, the system would strictly carry out the actions determined by the policies. In this work, we explore stochastic policy execution. We argue that stochastic execution can be more effective than deterministic execution because if the RL induced policy is sub-optimal, under the stochastic policy execution, it would still be possible for the system to carry out the optimal action; whereas if the induced policy is indeed optimal, our approach will make sure that when the decisions are crucial, the stochastic policy execution would behave like deterministic policy execution in that the optimal action will be carried out (see section 6.7 for details). We empirically evaluate the effectiveness of the stochastic policy execution but our results show that there is a ceiling effect.</p><p>Generally speaking, ITSs contain different types of tutorial decisions including what material to teach and how to teach it. In our work, we focus on applying RL to induce policy on the how part. To do so, we controlled the content (the what part) to be the same across all students and we mainly focus on one type of tutorial decision: whether to present a given problem as a problem solving (PS) or a worked example <ref type="bibr">(WE)</ref>. When providing a WE, the tutor presents an expert solution to a problem step-by-step so that the student sees both the answer and the solution procedure. During PS, students are required to complete the problem with tutor support. In a series of empirical experiments described below, we compare the RL-induced policies against a policy where the system randomly decides whether to present the next problem as WE or as PS. Because both PS and WE are always considered to be reasonable educational interventions in our learning context, we refer to such policy as a random yet reasonable policy or random.</p><p>Our results consistently suggest that there is an aptitude-treatment interaction (ATI) effect <ref type="bibr">(Cronbach and Snow, 1977;</ref><ref type="bibr">Snow, 1991)</ref>: certain students are less sensitive to the induced policies in that they achieve a similar learning performance regardless of policies employed, whereas other students are more sensitive in that their learning is highly dependant on the effectiveness of the policies. In the following, we refer to the former group as Unresponsive and latter group as Responsive respectively.</p><p>In short, we extensively explore applying the MDP framework for pedagogical policy induction on WE vs. PS and conduct extensive empirical experiments for investigating the effectiveness of induced RL policies. The effectiveness of the policies is evaluated by using students' in-class exam scores, referred to as transfer post-test score. Overall, our main contributions are summarized as follows:</p><p>&#8226; We found a consistent aptitude-treatment interaction effect across experiments: the performance of the Responsive group is significantly dependent on the implemented policies, whereas the Unresponsive group performs similarly regardless of policies.</p><p>&#8226; We induce RL policies based on immediate and delayed rewards respectively and detect the impact of reward on the effectiveness of policies.</p><p>&#8226; We propose correlation-based and ensemble feature selection approaches for state representation in the MDP framework and then empirically evaluate the RL-induced policies.</p><p>&#8226; We explore executing policies stochastically in contrast to previous research which mainly evaluates the RL-induced policies deterministically.</p><p>The rest of the paper is arranged as follows: Section 2 presents an overview of related work. Section 3 describes the reinforcement learning framework and Markov Decision Process. Section 4 describes the tutorial decisions, Deep Thought tutor, our training data, and state representation. Section 5 describes five correlation metrics and then introduces our proposed feature selection methods. Section 6 presents the overview of our four empirical studies and research questions. Section 7 reports experimental results for each of the four experiments. Section 8 presents our post-hoc comparison results. Finally, we summarize our conclusions, limitations and future work in Section 9.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">REINFORCEMENT LEARNING APPLICATIONS IN EDUCATION DOMAINS</head><p>Markov Decision Process (MDP; <ref type="bibr">Littman 1994;</ref><ref type="bibr">Sutton and Barto 1998</ref>) is a widely used reinforcement learning framework in educational applications. <ref type="bibr">Beck et al. (2000)</ref> investigated temporal difference learning to induce pedagogical policies that would minimize the time students spend on completing problems in AnimalWatch, an ITS that teaches arithmetic to grade school students. They used simulated students in the training phase of their study and used time as an immediate reward given that student's time can be assessed at each step. In the test phase, the new AnimalWatch with induced pedagogical policy was empirically compared with the original version. They found that the policy group spent significantly less time per problem than their non-policy peers.</p><p>Iglesias and her colleagues applied online Q-learning with time as the immediate reward to generate a policy in RLATES, an intelligent educational system that teaches students database design <ref type="bibr">(Iglesias et al., 2009a;</ref><ref type="bibr">Iglesias et al., 2009b;</ref><ref type="bibr">Iglesias et al., 2003)</ref>. The goal of inducing the policy was to provide students with direct navigation support through the system's content and to help them learn more efficiently. They also used simulated students in the training phase and evaluated the induced policy by comparing the performance of both simulated and real students using RLATES with that of other students using IGNATES, which provided indirect navigation support without RL. Their results showed that students using RLATES spent significantly less time than students using IGNATES, but there was no significant difference in students' level of knowledge evaluated by the exam. <ref type="bibr">Martin and Arroyo (2004)</ref> applied a model-based RL method with delayed reward to induce policies that would increase the efficiency of hint sequencing on Wayang Outpost, a web-based ITS. During the training phase, the authors used a student model to generate training data for inducing the policies. In the test phase, the induced RL policies were tested on a simulated student model and students' performance was evaluated by learning level, a customized score function. The results showed that students following RL policies achieved a significantly better learning level than the non-policy group.</p><p>Additionally, <ref type="bibr">Chi et al. (2011)</ref> applied a model-based RL method with delayed reward to induce pedagogical policies to improve the effectiveness of Cordillera, an intelligent natural language tutoring system that teaches students college physics. They collected an exploratory corpus by training human students on a version of the ITS that made random decisions. Their empirical evaluation showed the induced policies were significantly more effective than the previous policies based on students' normalized learning gain (NLG).</p><p>In short, most prior work on the application of MDP to ITSs has found no significant learning differences between the induced RL policies and baseline random policies. One potential explanation for this is that MDP relies on a small set of pre-defined state representations, which may not fully represent the real interactive learning environments.</p><p>Partially observable Markov Decision Process (POMDP; <ref type="bibr">Jaakkola et al. 1995;</ref><ref type="bibr">Koenig and Simmons 1998)</ref> is another widely used framework in educational domains. Different from the MDP framework where the state space is constructed by a set of observable features, the POMDP framework uses a belief state space to model the unobserved factors, such as students' knowledge level and proficiency. <ref type="bibr">Mandel et al. (2014)</ref> combined a feature compression approach that can handle a large range of state features with POMDP to induce policies for an educational game. The induced policies with the immediate reward outperformed both random and expertdesigned policies in both simulated and empirical evaluations. <ref type="bibr">Rafferty et al. (2016)</ref> applied POMDP to represent students' latent knowledge by combining embedded graphical models for concept learning with interpreted belief states in the domain of alphabet arithmetic. They applied POMDP to induce policies using time as the reward with a goal of reducing the expected time for learners to comprehend concepts. They evaluated policies using simulated and real-world studies and found that the POMDP-based policies significantly outperformed a random policy. <ref type="bibr">Clement et al. (2016)</ref> constructed models to track students' individual mastery of each knowledge component. They combined POMDP with the student models to induce teaching policies using learning gain as the immediate reward. The results of a series of simulated studies showed that the POMDP policies outperformed the learning theory-based policies in terms of students' knowledge levels. Similarly, <ref type="bibr">Whitehill and Movellan (2018)</ref> implemented POMDP to induce a teaching policy with the purpose of minimizing the expected time for foreign language learning in their ITS. The belief state of their POMDP was constructed based on a modified student model which hypothesized that students cannot always fully absorb the examples and so only partially update their belief state. They conducted a real-world study and verified that the POMDP policy performed favorably compared to two hand-crafted teaching policies.</p><p>Deep RL Framework is a subject of growing interest in inducing policies. Deep RL adds deep neural networks to RL frameworks such as POMDP for function approximation or state approximation <ref type="bibr">(Mnih et al., 2013;</ref><ref type="bibr">Mnih et al., 2015)</ref>. This enhancement makes the agent capable of achieving complicated tasks. <ref type="bibr">Wang et al. (2017)</ref> applied a deep RL framework for personalizing interactive narratives in an educational game called CRYSTAL ISLAND. They designed the immediate rewards based on normalized learning gain (NLG) and found that the students with the Deep RL policy achieved a higher NLG score than those following the linear RL model in simulation studies. Furthermore, <ref type="bibr">Narasimhan et al. (2015)</ref> implemented a Deep Q-Network (DQN) approach in text-based strategy games, constructed based on Evennia, which is an opensource library and toolkit for building multi-users online text-based games. In the DQN method, the state is represented by a Long Short-Term Memory network, the Q-value is approximated by a multi-layered neural network, and the immediate reward is designed based on the performance in the game. Using simulations, they found that the deep RL policy significantly outperformed the random policy in terms of quest completion.</p><p>Table <ref type="table">1</ref> summarizes the related work about the application of RL in the educational domain. While both POMDP and Deep RL have been shown to be highly effective in many real-world applications, they generally require a great deal of training data, especially Deep RL. More importantly, it is often hard to interpret the induced POMDP and Deep RL policies. Therefore, in this paper, we mainly focus on exploring the MDP framework, especially the tabular MDP framework. Compared with previous research, we explore three aspects of the MDP framework and evaluate the effectiveness of induced policies using a series of experiments conducted in real classroom settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">APTITUDE-TREATMENT INTERACTION (ATI) EFFECT</head><p>Previous work shows that the ATI effect commonly exists in many real-world studies. More formally, the ATI effect states that instructional treatments are more or less effective to individual Immediate Real Performance learners depending on their abilities <ref type="bibr">(Cronbach and Snow, 1977)</ref>. For example, <ref type="bibr">Kalyuga et al. (2003)</ref> empirically evaluated the effectiveness of worked example (WE) vs. problem solving (PS) on student learning in programmable logic. Their results show that WE is more effective for inexperienced students while PS is more effective for experienced learners. Moreover, D <ref type="bibr">'Mello et al. (2010)</ref> compared two versions of ITSs: one is an affect-sensitive tutor which selects the next problem based on students' affective and cognitive states combined, while the other is an original tutor which selects the next problem based on students' cognitive states alone. An empirical study shows that there is no significant difference between the two tutors for students with high prior knowledge. However, there is a significant difference for students with low prior knowledge: those who trained on the affect-sensitive tutor had significantly higher learning gain than their peers using the original tutor. <ref type="bibr">Chi and VanLehn (2010)</ref> investigated the ATI effect in the domain of probability and physics, and their results showed that high competence students can learn regardless of instructional interventions, while for students with low competence, those who follow the effective instructional interventions learned significantly more than those who did not. Our prior work consistently finds that for pedagogical decisions on WE vs. PS, certain learners are always less sensitive in that their learning is not affected, while others are more sensitive to variations in different policies. For example, <ref type="bibr">Shen and Chi (2016)</ref> trained students in an ITS for logic proofs, then divided students into the Fast and Slow groups based on time, and found that the Slow groups are more sensitive to the pedagogical strategies while the Fast groups are less sensitive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">PEDAGOGICAL DECISIONS: WORKED EXAMPLES VS. PROBLEM SOLVING</head><p>In this study, we investigate RL-induced pedagogical strategies on one type of tutorial decision: worked examples (WE) vs. problem solving (PS). A great deal of research has investigated the impacts of WE and PS on student learning <ref type="bibr">(McLaren and Isotani, 2011;</ref><ref type="bibr">McLaren et al., 2014;</ref><ref type="bibr">Najar et al., 2014;</ref><ref type="bibr">Salden et al., 2010)</ref>. During PS, students are given a training problem which they must solve independently or with partial assistance, while during WE, students are shown a detailed solution to the problem. <ref type="bibr">McLaren et al. (2008)</ref> compared WE-PS pairs with PS-only, where every student was given the same 10 training problems. Students in the PS-only condition were required to solve every problem while students in the WE-PS condition were given 5 example-problem pairs. Each pair consists of an initial worked example problem followed by tutored problem solving. They found no significant difference in learning performance between the two conditions; however, the WE-PS group spent significantly less time on task than the PS group.</p><p>McLaren and Isotani (2011) found similar results in two subsequent studies, which compared learning gains and time on task for high school chemistry students given 10 identical problems in three conditions: WE, PS, and WE-PS pairs. There were no significant differences among the three groups in terms of learning gains, but the WE group spent significantly less time on task than the other two conditions, and no significant time on task difference was found between the PS and WE-PS conditions. A follow-up 2014 study compared four conditions: WE, tutored PS, untutored PS, and Erroneous Examples (EE) in high school stoichiometry <ref type="bibr">(McLaren et al., 2014)</ref>. Students in the EE condition were given incorrect worked examples containing between 1 and 4 errors and were tasked with correcting them. Again the authors found no significant differences among the conditions in terms of learning gains, and as before the WE students spent significantly less time than the other groups. More specifically, for time on task they found that: WE &lt; EE &lt; untutored PS &lt; tutored PS. WE students took only 30% of the total time of the tutored PS students.</p><p>The advantages of WE were also demonstrated in another study in the domain of electrical circuits <ref type="bibr">(Van Gog et al., 2011)</ref>. In that study, they compared four conditions: WE, WE-PS pairs, PS-WE pairs (problem-solving followed by an example problem), and PS only. Their results showed that the WE and WE-PS students significantly outperformed the other two groups, and no significant difference was found among four conditions in terms of time on task. Additionally, <ref type="bibr">Razzaq and Heffernan (2009)</ref> designed an experiment on comparing worked examples vs. problem solving in an ITS that teaches mathematics. They found that more proficient students benefit more from WE when controlling for time, while less proficient students benefit more from PS.</p><p>Some existing theories of learning suggest that when deciding whether to present PS or WE, a tutor should take into account several factors, including the students' current knowledge model. <ref type="bibr">Vygotsky (1978)</ref> coined the term "zone of proximal development" (ZPD) to describe the space between abilities that students may display independently and abilities that they may display with support. He hypothesized that the most learning occurs when students are assigned tasks within their ZPD. In other words the task should neither be so simple that they can achieve it independently or trivially, nor so difficult that they simply cannot make progress even with assistance. Based upon this theory, we expect that if students are somewhat competent in all the knowledge needed for solving a problem, the tutor should present the problem as a PS, and provide help only if the students fail so that they can practice their knowledge. However, if students are completely unfamiliar with the problem, then the tutor should present the problem as a WE. <ref type="bibr">Brown et al. (1989)</ref> describe a progression from WE to PS following their "model, scaffold &amp; fade" rubric. <ref type="bibr">Koedinger and Aleven (2007)</ref> by contrast defined an "assistance dimension", which includes PSs and WEs. The level of assistance a tutor should provide may be resolved differently for different students and should be adaptive to the learning environment, the domain materials used, the students' knowledge level, their affect state and so on. Typically, these theories are considerably more general than the specific decisions that ITS designers must make, which makes it difficult to tell if a specific pedagogical strategy is consistent with the theory. This is why we hope to derive pedagogical policy for PS/WE decision making directly from empirical data.</p><p>Finally, compared with all previous studies in which the PSs and WEs are generally designed by domain experts or expert-like system developers, in this work both PSs and WEs are constructed through a data-driven approach using previous students log files (more details in section 4). In short, prior research on WE and PS has shown that WE can be as or more effective than PS or alternating PS with WE, and the former can take significantly less time than the latter two <ref type="bibr">(McLaren et al., 2014;</ref><ref type="bibr">Renkl et al., 2002;</ref><ref type="bibr">McLaren and Isotani, 2011;</ref><ref type="bibr">Mostafavi et al., 2015)</ref>. As opposed to previous work, which involves the hard-coded rules for providing PS or WE, we induce a data-driven pedagogical strategy which explicitly indicates how to make decisions given the current state of students and the learning context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">REINFORCEMENT LEARNING AND THE MDP FRAMEWORK</head><p>The Markov Decision Process (MDP) is one of the most widely used RL frameworks. In general, an MDP is defined as a 4-tuple S , A, T , R , where S denotes the observable state space, defined by a set of features that represent the interactive learning environment; A denotes the space of possible actions for the agent to execute; and T represents the transition probability where p(s, a, s ) is the probability of transiting from state s to state s by taking action a. Finally, the reward function R represents the immediate or delayed feedback where r(s, a, s ) denotes the expected reward of transitioning from state s to state s by taking action a. Since we apply the tabular MDP framework, the reward function R and transition probability table T can be easily estimated from the training corpus. The goal of MDP is to generate the deterministic policy &#960; : s &#8594; a that maps each state onto an action.</p><p>Once the tuple S , A, T , R is set, the optimal policy &#960; * for an MDP can be generated via dynamic programming approaches, such as Value Iteration. This algorithm operates by finding the optimal value for each state V * (s), which is the expected discounted reward that the agent will gain if it starts in s and follows the optimal policy to the goal. Generally speaking, V * (s) can be obtained by the optimal value function for each state-action pair Q * (s, a) which is defined as the expected discounted reward the agent will gain if it takes an action a in a state s and follows the optimal policy to the end. The optimal state value V * (s) and value function Q * (s, a) can be obtained by iteratively updating V (s) and Q(s, a) via equations 1 and 2 until they converge:</p><p>(1)</p><p>where 0 &#8804; &#947; &lt; 1 is a discount factor. When the process converges, the optimal policy &#960; * can be induced corresponding to the optimal Q-value function Q * (s, a), represented as:</p><p>where &#960; * is the deterministic policy that maps a given state into an action. In the context of an ITS, this induced policy represents the pedagogical strategy by specifying tutorial actions using the current state.</p><p>In the present work, the effectiveness of the MDP policy is estimated by Expected Cumulative Reward (ECR; <ref type="bibr">Tetreault and Litman 2008;</ref><ref type="bibr">Chi et al. 2011)</ref>. The ECR of a policy &#960; is </p><p>Where N denotes the number of trajectories in the training corpus, i.e. the total number of the initial states; and N i denotes the number of states S i as the initial states in the training corpus. In our case, the trajectories have a finite time horizon. Thus, ECR evaluates the expected reward of the initial states. The higher the ECR value of a policy, the better the policy is expected to perform.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">PEDAGOGICAL DECISIONS IN A LOGIC TUTOR: DEEP THOUGHT</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">OVERVIEW OF DEEP THOUGHT</head><p>Deep Thought (DT) is a data-driven ITS used in the undergraduate-level Discrete Mathematics course at North Carolina State University <ref type="bibr">(Behrooz and Tiffany, 2017)</ref>. DT provides students with a graph-based representation of logic proofs which allows students to solve problems by applying logic rules to derive new logical statements, represented as nodes. The system automatically verifies proofs and provides immediate feedback on rule application (but not strategy) errors. Every problem in DT can be presented in the form of either a worked example (WE) or problem solving (PS). In WE (shown in Figure <ref type="figure">1</ref>), students are given a detailed example showing the expert solution for the problem or were shown the best next step to take given their current solution state. In PS (shown in Figure <ref type="figure">2</ref>), by contrast, students are tasked with solving the same problem using the ITS or completing an individual problem-solving step. Focusing on the pedagogical decisions of choosing WE vs. PS allows us to strictly control the content to be equivalent for all students.</p><p>All of the hints that students receive for PS in DT are data-driven. Specifically, next-step hints for a PS are constructed by using previous successful student solutions which include the current proof state, and by matching current expressions in the proof. The hint presented at the current proof state guides the student to the most frequent next step that had resulted in successful completion of the proof given that proof state <ref type="bibr">(Stamper et al., 2013)</ref>, and is given to the student below the proof construction window on the left hand side of the tutor (shown in Figure <ref type="figure">2</ref>). The hints are in the format of "Use expression X and expression Y to derive expression Z using rule R". Students are given the opportunity to request hints on-demand by clicking the "Get Hint" button next to the dialogue box; however, if students stay in the current proof state for longer than the median step time of that problem or a maximum of 30 seconds, DT automatically presents the available hint. The WEs were constructed in a similar manner, where the most efficient (shortest-path) solution of the current proof from previous student solutions was used for a step-by-step presentation of the proof with procedurally constructed instructions given to the student below the proof window (Figure <ref type="figure">2</ref>). At each step, the instructions for constructing the next step are presented in the same format as the next-step hints until the conclusion is reached.</p><p>The problems in DT are organized into six strictly ordered levels with 3-4 problems per level. Level 1 functions as a pre-test in that all participants receive the same set of PS problems. In the five training levels 2-6, before the students proceed to a new problem, the system follows the corresponding RL-induced or random policies to decide whether to present the problem as PS or WE. The last question on each level is a PS without DT's help and thus functions as a quiz for evaluating students' knowledge of the concepts of that level. After completing the entire training in DT, students take an in-class exam, referred to as the transfer post-test. Given that the ultimate goal of the DT tutor is to improve students' performance on the real classroom exam, in the following the transfer post-test scores were used to evaluate students' learning performance and to investigate the effectiveness of pedagogical policies.</p><p>In the following, students' pre-test and transfer post-test scores are used for evaluations. We found that the pre-test scores can reflect students' incoming competence; a Pearson correlation test show that a significant correlation between students' pre-test and transfer post-test scores exists: r (239 ) = 0.17, p = .005. However, It is important to note that due to classroom constraints, the pre-test and transfer post-test covered different concepts and were collected at different times: the pre-test occurred in a single session before the policies were employed, while the transfer post-test scores were collected in the classroom after the entire training section is complete. Thus the two scores cannot be directly aligned. Additionally, the transfer post-test is the in-class written test that the RL policies aimed to improve. Therefore, we did not use learning gain to evaluate students' learning performance but rather compare their transfer post-test scores through the ANCOVA tests using pre-test score as the covariate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">TWO TRAINING DATASETS: DT-Imme AND DT-Delay</head><p>Our training dataset was collected in the Fall 2014 and Spring 2015 semesters, with a total of 306 students involved. All students were trained on DT where whether to present the next problem as a WE or a PS was randomly decided. The average number of problems solved by students was 23.7 and the average time that each student spent in the tutor was 5.29 hours. In addition, we calculated students' level scores based on their performance on the last problem in each of levels 1-6. For the sake of simplicity, level scores were normalized to [0, 100]. Note that when inducing RL policies using the training data set, reward functions are generated based on level scores because transfer post-test scores were not available for the two training datasets and the last problem on each level is designed to be very similar to problems in the transfer post-test. If the students quit the tutor during the training, we assigned a strong negative reward, -300 in this case, on the last problem they attempted. Furthermore, the immediate reward was defined as the difference between the current and previous level scores, and the delayed reward was defined as the difference of the level scores between level 1 and 6. From the interaction logs, we represent each observation using a high-dimensional feature space introduced in the following section. Combing observation with two types of rewards, we construct two different types of training datasets named DT-Imme and DT-Delay respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">STATE REPRESENTATION</head><p>A total of 133 state features, referred as to &#8486;, were extracted from the DT log files. More specifically, &#8486; includes 45 discrete or categorical features and 78 continuous features that can be grouped into five categories listed as follows:</p><p>1. Autonomy (AM). This category relates to the amount of student work done. For example, interaction denotes the cumulative number of student clickstream interactions and hintCount denotes the number of times a student clicked the hint button during problem solving. There are a total of 12 features in the AM category, including 8 categorical and 4 continuous features.</p><p>2. Temporal Situation (TS). This category encodes the time-related information about the work process. For example, avgTime denotes the average time taken per problem, and TotalPSTime denotes the total time for solving a particular problem. There are a total of 13 continuous features in the TS category.</p><p>3. Problem Solving (PS). This category encodes information about the current problemsolving context. For example, probDiff is the difficulty of the current solved problem; NewLevel indicates whether the current solved problem is in a new level in the tutor. There are a total of 30 features in the PS category, including 13 categorical and 17 continuous features.</p><p>4. Performance (PM). This category describes information about the student's performance during problem solving. For example, RightApp denotes the number of correct rule applications. There are a total of 36 features in the PM category, including 24 categorical and 12 continuous features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Student Action (SA)</head><p>. This category is a tutor-specific category for DT. It evaluates the statistical measurement of a student's behavior. For instance, actionCount denotes the number of non-empty-click actions that students take; AppCount denotes the number of clicks for the derivation of a logical expression. There are a total of 32 continuous features in the SA category.</p><p>Before feature selection and policy induction, we discretized all continuous features by exploring k-means clustering first and then a simple median split. The latter is conducted only if k-means failed to generate balanced bins. More specifically, the general discretization process is 1) for a given continuous feature, we start by using k-means with k = 5 to generate 5 bins; 2) if the sizes of the bins are not balanced, we reduce the value of k by 1 and repeat k-means until balanced bins are achieved; 3) otherwise, if k = 1, we use median split to discretize the feature.</p><p>In this work, we focus on applying different feature selection approaches to generate a small set of features to construct the state space in a tabular MDP framework. By doing so, we can shed some light on what the most important features are for deciding to apply PS vs. WE. Moreover, when applying RL in real-world scenarios, we may not always have the full computation power to track all of the features at once. Next, we describe the feature selection approaches in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">FEATURE SELECTION ON THE MDP FRAMEWORK</head><p>One of the biggest challenges of applying the tabular MDP framework into DT is the highdimensional feature space. Each state is a vector representation composed of a number of state features and thus the state space grows exponentially in the number of state features, which would cause a data sparsity problem (the available data is not enough to cover each state in the state space) and would exponentially increase the computational complexity. On the other hand, with respect to only including a small set of features, while existing learning literature and theories give helpful guidance on state representation, we argue that such guidance is often considerably more general than the specific state features chosen. For example, to describe a student's knowledge level, we can use "Percentage Correct" defined as the number of the correct student entries divided by the total number of the student entries, or "Number of Correct" defined as the number of the correct student entries, or "Number of Incorrect" defined as the number of the incorrect student entries and so on. When making specific decisions about including a feature of student knowledge level in the state, for example, it is often not clear which of these features should be included. Therefore a more general state representation approach is needed. To this end, this project began with a large set of features to which a series of feature-selection methods were applied to reduce them to a tractable subset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">RELATED WORK FOR FEATURE SELECTION IN RL</head><p>Much previous work on feature selection for RL mainly focused on model-free RL. Model-free algorithms learn a value function or policy directly from the experience while interacting with the agent. <ref type="bibr">Kolter and Ng (2009)</ref> applied Least-Squares Temporal Difference (LSTD) with Lasso regularized items to approximate the value function as well as to select an effective feature subset. Similarly, <ref type="bibr">Keller et al. (2006)</ref> applied LSTD to approximate a value function and select a feature subset by implementing Neighborhood Component Analysis to decompose approximation error, which can be used to evaluate the efficacy of the feature subset. <ref type="bibr">Bach (2009)</ref> explored the penalization of an approximation function by using multiple kernel learning. Additionally, <ref type="bibr">Wright et al. (2012)</ref> proposed the feature selection embedded in a neuro-evolutionary function which approximates the value function, and they selected each feature based on its contribution to the evolution of network topology.</p><p>For model-based RL, <ref type="bibr">Chi et al. (2011)</ref> previously investigated 10 feature selection methods, called RLpre-FS (Sec. 5.4). These methods were implemented to derive a set of various policies, where features are mostly selected based on the single feature's performance or covariance in training data. The results showed there was no consistent winner and in some particular cases these methods perform no better than the random baseline method.</p><p>Different from prior work, our features are selected based on the correlations through two steps: 1) a new feature is selected based on its correlation with the current "optimal" subset of features; 2) for different sets of state features, the same A, R and training data are used for estimating T when applying MDP to induce policies, and ECR is used to evaluate the induced policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">FIVE CORRELATION METRICS</head><p>Our feature selection methods involve five correlation metrics. The first four are commonly used in supervised learning, and here we will investigate whether they can be effectively applied for feature selection in RL. We propose the fifth metric, called Weighted Information Gain (WIG), by combining the first four metrics and adapting them based on the characteristics of our data sets. More specifically, we have:</p><p>1. <ref type="bibr">Chi</ref> </p><p>where H(&#8226;) is the entropy function -measuring the uncertainty of a variable. IG(Y, X) evaluates how the uncertainty of a variable Y would change from knowing the variable X. To some extent, it can also be treated as a type of correlation between X and Y . Note that IG has the bias towards variables with a large number of distinct values.</p><p>3. Symmetrical uncertainty (SU; Yu and Liu 2003) is defined as:</p><p>SU evaluates the correlation between two variables Y and X by normalizing IG(Y, X). SU compensates for the weakness of IG by considering the uncertainty of both variables X and Y in the denominator.</p><p>4. Information gain ratio (IGR; <ref type="bibr">Kent 1983</ref>) is the ratio of information gain to the intrinsic information, which is the entropy of conditional information. IGR can be represented as:</p><p>Compared with SU, IGR only considers the uncertainty of variable X in the denominator.</p><p>5. Weighted Information gain (WIG) is proposed as:</p><p>WIG can be seen as a combination of IG, SU and IGR. Compared to SU, WIG sets more weight on X by multiplying H(X) in the denominator; while compared to IGR, WIG normalizes IG by considering the uncertainty of both variables X and Y .</p><p>In our application, each of the five correlation metrics is used for evaluating the correlation between the current selected feature set Y with a new feature X. For each metric we explore two options: The High option is to select the next feature that is most correlated to the currently selected feature set whereas the Low option is to select the least correlated feature. As described above, the high correlation-based option is commonly used for supervised learning where the features that are most highly correlated with the output labels are often selected <ref type="bibr">(Yang and Pedersen, 1997;</ref><ref type="bibr">Lee and Lee, 2006;</ref><ref type="bibr">Chandrashekar and Sahin, 2014;</ref><ref type="bibr">Koprinska et al., 2015)</ref>. However, for RL, the high option-selected features tend to be homogeneous. Different from the supervised learning tasks, we hypothesize that it is more important to have heterogeneous features in RL that can grasp different aspects of learning environments. Therefore, we also explore the low correlation-based option for feature selection with a goal to increase the diversity of the selected feature set. As a result, we have 10 correlation-based methods named: CHI-high, IG-high, SU-high, IGR-high, WIG-high, CHI-low, IG-low, SU-low, IGR-low, and WIG-low. Our goal is to investigate which option is better: high vs. low, and which of the five correlation metric performs the best.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">CORRELATION-BASED FEATURE SELECTION APPROACHES</head><p>Algorithm 1 shows the process of our correlation-based feature selection method. It contains three major parts. In the first part (lines 1-4), the algorithm constructs MDPs for every single feature in &#8486;, induces a single-feature policy and calculates its ECR (defined in Sect. 4). Then the feature with highest ECR is added to the current optimal feature set S * . In the second part (lines 6-9), the algorithm follows a forward step-wise feature selection procedure in that, given the currently selected feature set S * , it selects the next feature based on the five correlation metrics described above. More specifically, it first calculates the correlations between S * with each feature f i &#8712; &#8486; -S * using a specific correlation metric m, ranks the results, and then selects the top 5 features with the highest correlations for high-option or the bottom 5 lowest features for low options, decided by the Boolean variable reverse in line 9. These features are selected to form a feature pool F . In the third part (lines 10-13), the current S * is combined with each feature f i &#8712; F to induce a policy, and the Calculate-ECR function calculates the ECR of the induced policy. Then S * + f k , the combination that produces the policy with the highest ECR, will be the new S * for the next round. The algorithm will terminate when the size of an optimal feature set reaches maximum number N . </p><p>m refers to a correlation metrics 8:</p><p>end for 9:</p><p>F &#8592; SELECTTOP(C, 5, reverse) Select top or bottom 5 features based on metrics m 10:</p><p>for f i in F do 11:</p><p>end for 13:</p><p>Replace S * by S * + f k with highest ECR 14: end while 5.4. PRERL-FS APPROACH <ref type="bibr">Chi et al. (2011)</ref> developed a series of feature selection approaches, referred to as PreRL-FS in the following. They can be grouped into three categories: 1) four ECR-based methods, which use ECR, Upper-Bound of ECR, Lower-Bound of ECR, or Hedge value of the single-feature policy as the feature selection criteria where the Upper-Bound and Lower-Bound of ECR refer to the 95% confidence interval for ECR, and Hedge is defined as Hedge = ECR/(U pperBound-LowerBound); 2) two PCA-based methods, which select features that are highly correlated with principal components; and 3) four ECR &amp; PCA-based methods, the combination of the former two approaches. The results indicated that the four ECR-based methods outperformed the other two types of approaches in terms of ECR. </p><p>for Method k in M do 8:</p><p>end for 11:</p><p>end for 14:</p><p>Replace S * by S * + f k with highest ECR 15: end while</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">ENSEMBLE APPROACH</head><p>Algorithm 2 shows the basic process of our ensemble feature selection procedure, which is similar to that of correlation-based methods. The major difference is in the second part (lines 6-10). Our ensemble approach explored a total of 12 feature selection methods that are referred to as M in Algorithm 2: the four ECR-based methods which are the better methods among the PreRL-FS approaches and the eight out of the 10 proposed correlation-based methods (WIGhigh and WIG-low were excluded here because they were not explored when we first explored the ensemble approach). More specifically, the ensemble approach integrates the features F k generated from each of feature selection method Method k in M and generates a relatively large feature pool F . The maximum size of F can be up to 70, but often much smaller because of the overlapping of feature sets generated from different methods. Note that the feature pool is still much larger than any of our 10 correlation-based methods, which is 5. After generating the feature pool, the ensemble method carries out the same procedure, the third part (lines 11-13), as the correlation-based methods described above. Although the ensemble method has a relatively high computational complexity, it has a wider exploration of the feature space by integrating different types of feature selection methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">COMPARISON RESULTS FOR FEATURE SELECTION APPROACHES</head><p>We explore three categories of feature selection approaches: PreRL-FS, ensemble, and highand low-correlation-based approaches and compare them against a random feature selection baseline. We use ECR to theoretically evaluate the effectiveness of the MDP policies, which indirectly verify the effectiveness of feature selection approaches. Note that ECR is calculated based on the induced MDP policies and the two training datasets: DT-Immed and DT-Delay (Section 4.2). High VS Low Correlation-based Approaches. Figure <ref type="figure">3</ref> and Figure <ref type="figure">4</ref> show the ECR values of 10 correlation-based methods on DT-Immed and DT-Delay respectively, where the y-axis represents the value of the ECR of the induced policy given the selected features, and the x-axis denotes the number of features (maximum is 8). Note that all feature selection methods start in the same place at x = 1 except the random method. This is because all methods will initially select the feature with the best ECR of single-feature policy. However, ECR values vary dramatically as the number of selected features increases. The solid line indicates the performance of the low correlation-based approaches and the dotted line denotes the performance of the high correlation-based version. In addition, the ECR value of policies using immediate reward is much higher than that of policies using the delayed reward.</p><p>The results show that for each of the five correlation metrics, the low correlation-based option significantly outperforms the high correlation-based option. For the DT-Immed dataset, the ECR of WIG-low is 143.16, while ECR of WIG-High is only 59.04. Similarly, the ECRs of CHI-low and CHI-High are 129.82 vs. 55.90. The average percent increase for the low correlation methods over the high correlation methods is 75.35%, the maximum percent increase is 142.48%, and the minimum percent increase is 17.24%. To summarize, our results show that low correlation is more suitable for the MDP framework than high correlation, and indirectly illustrate that the high variety of the feature space had a positive impact on the effectiveness of the induced policies. The same pattern was found in the DT-Delay dataset.</p><p>Five Correlation Metrics. Figures <ref type="figure">3</ref> and<ref type="figure">4</ref> show that WIG is the consistently highest performer in that it has the best ECR for both DT-Immed and DT-Delay datasets. CHI performed well in DT-Immed dataset while IGR performed well in DT-Delay dataset. In short, our proposed WIG performed best among all the five correlation metrics.</p><p>Overall Comparison. Figures <ref type="figure">5</ref> and<ref type="figure">6</ref> show the overall comparison among all methods on DT-Immed and DT-Delay data respectively. Particularly, with the purpose of simplicity, for both low and high correlation-based methods and the PreRL-FS methods, we selected the best method from each category. In other words, the figures present a comparison among the five methods including the best of five Low-correlations, the best of five High-correlations, ensemble, the best of PreRL-FS, and the random approach. Results show that, as expected, the random method performs worst across the two datasets. In addition, the best of the high correlation-based methods outperforms random and Best-RLPreviousFS approaches when the number of features is above 5. The best of the low correlation-based methods outperforms other methods. In general, the best low correlation-based method outperforms the best of PreRL-FS by an average of 43.87% and outperforms the ensemble method by an average of 9.05%. In addition, the ensemble method improves over the best of PreRL-FS by an average of 36.46%. The value of ECR does not always rise as the the number of features increases. The ECR of the low-correlation approach decreases a lot when increasing the number of features from 6 to 8. The ECR of the ensemble method seems to converge when the number of features is more than 6 for both two training datasets. The ECR of the best of PreRL-FS decreases when the number of features is more than 4.</p><p>In summary, based on ECR results we can rank five categories of methods as Low correlationbased &gt; Ensemble &gt; High correlation-based &#8776; PreRL-FS Random. In particular, the WIG-Low approach performs best among all implemented approaches. In this work, we investigate the effectiveness of RL-induced policies using the MDP framework from three aspects: state representation using different feature selections, reward function, and policy execution options. For each aspect, we have a corresponding research question and thus our three research questions are listed as follows:</p><p>&#8226; Q1 (State): Can effective feature selection methods empirically improve the effectiveness of the induced policy?</p><p>&#8226; Q2 (Reward): Does immediate reward facilitate the MDP framework to induce a more effective pedagogical policy than delayed reward?</p><p>&#8226; Q3 (Execution): Can stochastic policy execution be more effective than deterministic policy execution?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">FIVE REINFORCEMENT LEARNING POLICIES</head><p>Table <ref type="table">2</ref> lists the five RL policies induced for investigating the three research questions above. All five policies were induced using the MDP framework but involved different types of feature selection methods (the second column), reward function (the third column), and/or policy execution (the fourth column). The last column shows that the ECR of the RL-induced policies. More specifically, MDP-ECR is induced by using MDP with the best PreRL-FS feature selection approach; Ensemble-Imme and Ensemble-Delay are two policies induced with the ensemble feature selection approach using immediate and delayed reward respectively; and WIG-det and WIG-sto were both induced using WIG with the low-correlation option for feature selection, and the main difference is that the former is executed deterministically while the latter is executed stochastic. Note that because WIG-sto is a stochastic policy and because ECR can only be calculated for a deterministic policy, the ECR of WIG-sto is listed as "NA". </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">FOUR EXPERIMENTS</head><p>Four experiments, one per semester from the Spring of 2015 to the Fall of 2017, were conducted to empirically evaluate the impact of the three aspects on the effectiveness of the five RL-induced policies described above. In each experiment, we compared one or two RL policies against the Random yet reasonable baseline policy. Table <ref type="table">3</ref> shows the overview of the four experiments and the corresponding research questions. Overall, results across the four experiments consistently exhibit an ATI effect. That is, rather than all students, only certain students' learning is significantly affected by the pedagogical decisions on PS vs. WE. In the following, they are referred to as the Responsive group and by contrast, we refer to other students as the Unresponsive group. It is often not clear which students are more sensitive to the induced policy due in part to the fact that we do not fully understand why such differences exist. In this work, we split Responsive and Unresponsive groups based upon some measurement of incoming competence.</p><p>One common way to measure students' incoming competence is to use their pre-test scores. Across the four experiments, all of the students received the same initial training at Level 1 and our results showed that students' pre-test scores indeed reflect their incoming competence in that a significant positive correlation between students' pre-test scores and transfer post-test scores: r = 0.17, n = 241, p = .005. However, applying a median split on pre-test for all participants results in unbalanced splits within treatment groups. For example, in Experiment 1, a split using the median value of student's pre-test scores would divide the Random group into 16 High pretest group vs. 6 in the Low pre-test group. Similarly, in Experiment 3, the WIG-det group would divide into 31 in the High pre-test group and 14 in the Low pre-test group.</p><p>On the other hand, ever since the mid-1950s, response time has been used as a preferred dependent variable in cognitive psychology <ref type="bibr">(Luce et al., 1986)</ref>. It has often been used to assess student learning because response time can indicate how active and accessible student knowledge is. For example, it has been shown that response time reveals student proficiency <ref type="bibr">(Schnipke and Scrams, 2002)</ref> and that students' average response time and their final exam scores are negatively correlated <ref type="bibr">(Gonz&#225;lez-Espada and Bullock, 2007)</ref>. With the advent of computerized testing, more and more researchers have begun to use response-time as a learning performance measurement <ref type="bibr">(Schnipke and Scrams, 2002)</ref>. Inspired by this prior work, we use the average time in Level 1 (avgTime) to split students which consistently generated more balanced groups across all four experiments. Therefore, in the following studies, students were split using avgTime.</p><p>To summarize, in each of the following experiments, students are divided into Responsive and Unresponsive groups by a median split on their response time at Level 1. Since each experi-ment has a slightly different median value and criteria for splitting, there is no general definition for the Responsive and Unresponsive groups. In the post-hoc comparison, we combined all of the experiments and used a global median split to check whether our results would still hold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5.">STATISTICAL ANALYSIS</head><p>In the following analyses, we run several different types of statistical tests to evaluate student performance with a focus on their transfer post-test scores. Although students' pre-test scores were not used to split students into Responsive and Unresponsive groups, they are used as the covariate in ANCOVA when comparing students' transfer post-test scores.</p><p>To confirm that the assumptions of ANCOVA were met, for each experiment ANOVA tests were performed and indicated that there is no significant difference on pre-test score among different treatment groups. In addition, two-way ANOVA tests for each experiment using group and pre-test as factors show that there is no significant interaction effect on transfer post-test score. These results indicate that the pre-test covariate and treatment group variable are independent and that the relationship between the covariate and treatment group variable is the same across groups. Thus, the assumptions of ANCOVA are met, and we report ANCOVA results for the transfer post-test scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">FOUR EXPERIMENTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">EXPERIMENT 1: PRELIMINARY FEATURE SELECTION</head><p>Experiment 1 was conducted in the Spring of 2015. We compared two policies: an MDP policy and a Random baseline policy. Our research question in Experiment 1 is Q1 (State): Can effective feature selection methods empirically improve the effectiveness of the induced policy?</p><p>For Experiment 1, we only explored the PreRL-FS feature selection approaches, and among them, the ECR-based approach using the lower bound of ECR as the selection criteria performed the best. In the following, we refer to the induced policy as the MDP-ECR policy. Table <ref type="table">4</ref> shows the definition of the four selected features (left) and the corresponding policy (right). The row denotes the value of the first two features while the column denotes the value of the last two features. For example, when the four features f 1 :f 2 :f 3 :f 4 is 0:0:0:0 (the top-left cell), the decision is a PS (black cell). Overall, the MDP-ECR policy contains 11 pedagogical rules that propose a PS (black cells) and 5 rules that propose a WE (white cells).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1.">Experiment 1: Participants &amp; Conditions</head><p>DT was assigned to 67 undergraduate students as one of their regular homework assignments. Completion of the tutor was required for full credit. Students were randomly assigned to the two conditions: Random (N = 22) and MDP-ECR (N = 45). Because all students followed the random policy when collecting our training data for RL in previous years, we assign more students to the MDP-ECR condition to evaluate the effectiveness of RL-induced policies.</p><p>Results of Experiment 1 show that there is no significant difference between the MDP-ECR and Random on either pre-test (F (1, 65) = 1.81, p = 0.18) or transfer post-test (F (1, 65) = 0.46, p = 0.50). However, once we did a median split on students based on the students' "average response time on level 1", our results show that students whose level1-avgstepTime &lt; 7.1 sec are more sensitive to the effectiveness of pedagogical strategies than their peers whose  Pearson's Chi-squared test showed that there was no significant difference on the distribution of Unresponsive vs. Responsive between the two policies, &#967; 2 (1, N = 67) = 0.27, p = 0.59. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2.">Experiment 1: Results</head><p>Table <ref type="table">5</ref> presents the mean and SD for students' corresponding learning performance in Experiment 1. Despite the fact that students are randomly assigned, Random-Unres significantly outperforms all other groups on the pre-test according to results of ANOVA tests: F (1, 20) = 9.01, p = .007 for Random-Resp, F (1, 33) = 8.82, p = .006 for MDP-ECR-Unres, F (1, 34) = 6.37, p = .016 for MDP-ECR-Resp, probably due to the small sample size in the random groups. Despite Random-Unres out-performance, no significant difference is found on the pre-test score either between MDP-ECR and Random (two columns): F (1, 65) = 1.81, p = 0.18, or between Responsive and Unresponsive (two rows): F (1, 65) = 1.26, p = 0.27. Furthermore, a possible explanation for a high pre-test score of Random-Unres is that Random-Unres, considered as the high proficiency students, can always learn regardless of teaching policies and are less sensitive to the learning environment <ref type="bibr">(Cronbach and Snow, 1977;</ref><ref type="bibr">Chi et al., 2011)</ref>. Transfer Post-Test Score. A two-way ANCOVA test, using Policy and Type as the two factors and the pre-test score as covariate shows that there is a significant interaction effect on their transfer post-test scores: F (1, 62) = 5.39, p = .023, but no significant main effect of either Policy or Type. Figure <ref type="figure">7</ref> depicts the cross-over interaction between Policy and Type on the adjusted transfer post-test score, which is the transfer post-test score adjusted by the linear regression (ANCOVA) model built to describe the relation between the pre-and transfer post-test score.</p><p>Planned contrasts using Tukey's adjustment reveal a significant difference between the two Responsive groups in that MDP-ECR-Resp scored significantly higher adjusted transfer posttest than Random-Resp, t(62) = 2.26, p = .027, while there is no significant difference between two Unresponsive groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.3.">Experiment 1: Conclusion</head><p>In summary, we find a significant interaction effect in that MDP-ECR benefits the Responsive students significantly more than the Unresponsive students, while no such difference was found between the Responsive and Unresponsive groups under the Random policy. However, one important limitation of Experiment 1 is that the Random-Unres group has significant higher pretest score than all other groups. Therefore, in Experiment 2, we repeat the general procedure of Experiment 1 but explored correlation based and ensemble-based feature selection methods and also explore both Immediate and Delayed rewards.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">EXPERIMENT 2: ENSEMBLE FEATURE SELECTION &amp; IMMEDIATE VS. DELAYED REWARDS</head><p>Experiment 2 was conducted in the Fall of 2016 and investigated two research questions:</p><p>&#8226; Q1 (State): Can effective feature selection methods empirically improve the effectiveness of the induced policy?</p><p>&#8226; Q2 (Reward): Does immediate reward facilitate the MDP framework to induce a more effective pedagogical policy than delayed reward?</p><p>In Experiment 2, we applied ensemble feature selection (Section 5.5) to select a small subset of features from the original 133 features for inducing two policies, named Ensemble-Imme and Ensemble-Delay, from the two training datasets DT-Imme and DT-Delay respectively (Section 4.2). More specifically, Ensemble-Imme involves seven features and Ensemble-Delay policy involves six features. Table <ref type="table">6</ref> and 7 display the selected features as well as the corresponding policies. In the tables, black cells denote PS actions, white cells denote WE actions, and gray cells denote that no policy is induced due to the absence of the state in the training data. Generally speaking, the Ensemble-Imme policy prefers WE over PS as it contains 65 rules for WE vs. 21 rules for PS; while Ensemble-Delay policy prefers PS over WE as it has 48 rules for PS and 18 for WE. Additionally, while Figure <ref type="figure">5</ref> shows that the ensemble feature selection with eight features would result in a higher ECR policy than the policy with seven features, we still used the latter here because 1) the ECRs of the two policies are actually very close; and 2) the seven-feature policy is less complicated and has less "none-mapping" from state to action (the gray color cells) compared with the eight-feature policy. For similar reasons, we determined the Ensemble-Delay policy to be six features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.1.">Experiment 2: Participants and Conditions</head><p>A total of 106 students participated in Experiment 2 and were randomly assigned into three conditions: Random (N = 30), Ensemble-Imme (N = 38) and Ensemble-Delay (N = 38). 94 Last three features f I5 :f I6 :f I7 0:0:0 0:0:1 0:1:0 0:1:1 1:0:0 1:0:1 1:1:0 1:1:1 0:0:0:0 0:0:0:1 0:0:1:0 0:0:1:1 0:1:0:0 0:1:0:1 0:1:1:0 First four features f I1 :f</p><p>1:1:1 1:0:0:0 1:0:0:1 1:0:1:0 1:0:1:1 1:1:0:0 1:1:0:1 1:1:1:0 1:1:1:1 Last three features f D4 :f D5 :f D6 0:0:0 0:0:1 0:1:0 0:1:1 1:0:0 1:0:1 1:1:0 1:1:1 0:0:0 0:0:1 0:1:0 0:1:1</p><p>First three features f D1 :f D2 :f D3 0:2:0 0:2:1 1:0:0 1:0:1 1:1:0 1:1:1 1:2:0 1:2:1 students completed the assignment, distributed as Random (N = 27), Ensemble-Imme (N = 34) and Ensemble-Delay (N = 33). Pearson's chi-squared test yielded no significant relation between completion rate and condition, &#967; 2 (2, N = 106) = .012, p = .994.</p><p>The last row in Table <ref type="table">8</ref> (a) presents the mean and SD for students' corresponding learning performance in Experiment 2. No significant difference was found among the three policies on either pre-test (F (2, 91) = 0.04, p = 0.96) or transfer post-test (F (2, 91) = 1.33, p = 0.27). Furthermore, similar as Experiment 1, we use the median of "average response time on level 1" (median(level1-avgstepTime) = 8.01 sec) to split students in Experiment 2 . Different from Experiment 1, it was shown that students whose level1-avgstepTime &lt; 8.01 sec are less sensitive to the effectiveness of pedagogical strategies than those whose level1-avgstepTime &#8805; 8.01 sec, and thus we refer the former as the Unresponsive group and the latter as the Responsive group.</p><p>By combining Policy with Type {Responsive, Unresponsive}, we have a total of six groups including three Unresponsive groups: Random-Unres (N = 15), Ensemble-Imme-Unres (N = 16), Ensemble-Delay-Unres (N = 15); and three Responsive groups: Random-Resp (N = 12), Ensemble-Imme-Resp (N = 18), and Ensemble-Delay-Resp (N = 18). Pearson's chisquared test shows that there is no significant difference in the distribution of Unresponsive vs. Responsive among the three conditions, &#967; 2 (1, N = 94) = .681, p = .711.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.2.">Experiment 2: Results</head><p>Table <ref type="table">8</ref> presents the mean and SD for students' corresponding learning performance. One-way ANOVA tests show that there is no significant difference on the pre-test score either among the three policies {Ensemble-Imme, Ensemble-Delay, Random}, F (2, 91) = 0.04, p = 0.96, or among the three Unresponsive groups, F (2, 43) = 0.14, p = 0.87, or among the three Responsive groups, F (2, 45) = 0.65, p = 0.53. Additionally, there is a significant difference between Responsive and Unresponsive: the former scores significantly higher than the latter on the pre-test score, F (1, 92) = 7.33, p = .008.  &#8226; marginal significant at p &lt; 0.1; * significant at p &lt; 0.05.</p><p>Transfer Post-Test Score. A two-way ANCOVA test, using Policy {Ensemble-Imme, Ensemble-Delay, Random} and Type {Responsive, Unresponsive} as two factors and the pre-test score as covariate, shows that there is a significant main effect of Type, F (1, 87) = 4.45, p = .037, and a significant interaction effect on transfer post-test scores, F (2, 87) = 3.90, p = .024. Figure <ref type="figure">8</ref> presents the cross-over interaction between Policy and Type on the adjusted transfer post-test score, which is the transfer post-test score adjusted by the linear regression model built to describe the relation between the pre-and transfer post-test score.</p><p>Table <ref type="table">9</ref> presents the results of contrast tests using Tukey's adjustment for multiple comparisons. Results indicate that while there is no significant difference among three Unresponsive groups, Ensemble-Imme-Resp achieved significantly higher adjusted transfer post-test score than Random-Resp: p = 0.01.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.3.">Experiment 2: Conclusion</head><p>Our empirical results suggest that the ATI effect exists in Experiment 2: while no significant difference is found among the three Unresponsive groups, a significant difference is found among the three Responsive groups in that students following the Ensemble-Imme policy score significantly higher on the transfer post-test than their peers following the Random policy. This suggests that immediate reward can facilitate the MDP framework to induce an effective policy and that the ensemble feature selection approach is able to extract a good subset of features for MDP to induce a more effective policy compared with the Random policy. Finally, since it was shown that the immediate reward is more effective than the delayed reward for policy induction in the MDP framework in Experiment 2, we will only use the immediate reward to induce policy in the following two experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">EXPERIMENT 3: LOW CORRELATION-BASED FEATURE SELECTION</head><p>Experiment 3 was conducted in the Spring of 2017, and the goal was to further investigate the effectiveness of our feature selection methods. So the research question for Experiment 3 is Q1 (State): can effective feature selection methods empirically improve the effectiveness of the induced policy?</p><p>Results of feature selection showed that the policy with the highest ECR is induced when WIG-Low is applied and the number of selected features is six (see Figure <ref type="figure">5</ref> in Section 5), so in Experiment 3 we implemented and empirically evaluated the induced WIG-det policy. Table <ref type="table">10</ref> shows the selected features and WIG-det policy, which contains only 9 rules associated with PS but 46 rules for WE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.1.">Experiment 3: Participants and Conditions</head><p>A total of 92 students were randomly assigned into two different groups: Random (N = 45) and WIG-det (N = 47). In the end, a total of 82 students completed the assignment and were distributed as follows: Random (N = 38) and WIG-det (N = 44). Pearson's chi-squared test revealed no significant relationship between completion rate and condition &#967; 2 (1, N = 92) = .034, p = .852.</p><p>The last row in Table <ref type="table">11</ref> (a) shows the mean and SD for either condition's corresponding learning performance. No significant difference was found between WIG-det and Random on either pre-test (F (1, 80) = 2.02, p = 0.16) or transfer post-test (F (1, 80) = 1.74, p = 0.19). Furthermore, as in Experiments 1 and 2, we perform a median split using the "average response time on level 1" (level1-avgstepTime) to split students and find that students whose level1-avgstepTime &lt; 8.34 sec are less sensitive to the effectiveness of pedagogical strategies  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.2.">Experiment 3: Results</head><p>Table <ref type="table">11</ref> presents the mean and SD for students' corresponding learning performance in Experiment 3. One-way ANOVA tests show that no significant difference is found on the pre-test score either between WIG-det and Random, F (1, 80) = 2.03, p = 0.16, or between Responsive and Unresponsive groups, F (1, 80) = 0.67, p = 0.42. Additionally, no significant difference is found either between the two Responsive groups, F (1, 40) = 0.87, p = 0.36, or between the two Unresponsive groups, F (1, 38) = 1.21, p = 0.28.  Transfer Post-Test Score. A two-way ANCOVA test using Policy and Type as two factors and the pre-test score as covariate shows that there is no significant main effect of either Policy or Type, but there is a significant interaction effect on post-test score, F (1, 77) = 6.94,p = .010. Figure <ref type="figure">9</ref> depicts the cross-over interaction between Policy and Type on the adjusted transfer post-test score.</p><p>Furthermore, planned contrasts using Tukey's adjustment indicate a significant difference between the two Responsive groups in that WIG-Resp achieved the significantly higher adjusted transfer post-test score than Random-Resp, t(77) = 2.54, p = .013, while there is no significant difference between two Unresponsive groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.3.">Experiment 3: Conclusion</head><p>Again results from Experiment 3 shows that there is an ATI effect. The Unresponsive groups are less sensitive to the policies in that they achieve a similar performance on the transfer post-test scores, while the Responsive groups are more sensitive in that their performances are strongly dependent on the effectiveness of the policy. Specifically, the WIG-det policy is more effective than the Random policy for the Responsive groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.">EXPERIMENT 4: STOCHASTIC POLICY EXECUTION</head><p>In Experiments 1-3, all RL policies were executed deterministically, that is, the action was fully carried out given a state according to the induced RL-policies. However, one classic problem in RL is finding a balance between exploration (discovering more about the world) and exploitation (using what we already know to maximize performance). One approach to improving deterministic policies is to execute them stochastically, where each action is associated with a probability and has a chance to be selected. Therefore, we converted the WIG-det policy in Experiment 3 into a stochastic policy, called WIG-sto, and conducted Experiment 4 in the Fall of 2017. Our purpose here is to investigate two research questions:</p><p>&#8226; Q1 (State): Can effective feature selection methods empirically improve the effectiveness of the induced policy?</p><p>&#8226; Q3 (Execution): Can stochastic policy execution be more effective than deterministic policy execution?</p><p>The crucial part of stochastic policy execution is to assign a probability to each action. Note that in a policy &#960;, each action a for a particular state s is associated with a Q-value, called Q &#960; (s, a) calculated by using Equation 1 in Section 3. Thus, we transform Q &#960; (s, a) into probability p &#960; (s, a) by the softmax function <ref type="bibr">(Sutton and Barto, 1998)</ref>, shown as follows:</p><p>Here &#964; is a positive parameter, which controls the variance of probabilities for the state and action pair. Generally speaking, when &#964; &#8594; 0, the stochastic policy execution is close to random decision-making. When &#964; &#8594; +&#8734;, the stochastic policy execution becomes deterministic. In order to determine the optimal &#964; , we use Importance Sampling <ref type="bibr">(Peshkin and Shelton, 2002)</ref> which can mathematically evaluate the effectiveness of policies with different &#964; values. Specifically, Importance Sampling (IS) of a policy &#960; is formulated as follows:</p><p>Where N D denotes the number of trajectories in the training corpus D; L i is the length of the ith trajectory; s i t , a i t and r i t are the state, action and reward at the tth time step of the ith trajectory respectively; and p d (s i t , a i t ) is the probability of taking the action a i t for the state s i t , calculated based on the other policy d, which generates the training corpus D. In our case, the decision in the training corpus is randomly decided, thus p d (s i t , a i t ) always equal to 0.5. In general, the higher value of IS(&#960;|D), the better policy &#960; is supposed to be.</p><p>We explored a wide range of &#964; and found that the optimal value of &#964; is 0.06 for the MDPbased policies. Moreover, it is important to note that based on Equation <ref type="formula">9</ref>, for a given state s, that if the Q-value of the optimal action a * is much higher than the Q-values of other alternative suboptimal actions, then the stochastic policy execution becomes deterministic in that the probability of carrying out the optimal action would be closer to 1; if the difference between them is small, then the stochastic policy execution becomes closer to random. A total 101 of students were randomly split into two distinct groups, Random (N = 51) and WIG-sto (N = 50). In the end, a total of 88 students completed the experiment, distributed as Random (N = 44) and WIG-sto (N = 44). Pearson's chi-squared test shows that no significant relationship exists between completion rate and condition, &#967; 2 (1, N = 101) = 0, p = 1. Table <ref type="table">12</ref> presents the mean and SD for students' corresponding learning performance in Experiment 4. There is no significant difference between WIG-det and Random on either pretest, F (1, 86) = 0.22,p = 0.64, or transfer post-test, (1, 86) = 0.02,p = 0.96, due to a ceiling effect: about 72.8% of students receive a transfer post-test score of 100. As a result, the WIG-sto group scores as high on the transfer post-test as the Random group.</p><p>Furthermore, as in Experiments 2 and 3, we conduct a median split using the "average response time on level 1" (level1-avgstepTime, median = 5.29 sec). Note that this median time is much lower than those used in Experiments 1-3. After splitting, the ceiling effect was found among all four groups of students.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.2.">Experiment 4: Conclusion</head><p>Despite the fact that we used the same DT version, had similar test items in the transfer posttest, and had balanced assignment of students involved in Experiment 4, we found a ceiling effect on the transfer post-test score, which is a significant limitation of Experiment 4. While it is not clear whether the stochastic policy execution would indeed have an effective impact on students' learning performance, it did show that when conducting empirical studies in this domain, we still face many challenges that need to be addressed, especially how to effectively evaluate the induced policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5.">CONCLUSIONS OF EXPERIMENTS</head><p>We investigated the impact of reward function, state representation, and policy execution on the effectiveness of RL-induced policies using the MDP framework. Four experiments were conducted to compare a series of RL-induced policies with that of a Random policy. With the exception of a ceiling effect found in Experiment 4, an ATI effect is consistently observed across Experiments 1-3 after splitting students into the Responsive and Unresponsive groups using their level1-avgstepTime. Specifically, the Unresponsive groups are less sensitive to the effectiveness of policies in that they perform similarly to their random peers regardless of the policies, while the Responsive groups are more sensitive to the RL-induced policies.</p><p>For the reward function, we found that using Immediate rewards works more effectively than using Delayed rewards in Experiment 2, while no significant difference is found between Ensemble-Delay-Resp and Random-Resp. For policy execution, unfortunately, we can not determine the effectiveness of the stochastic policy execution due to a ceiling effect on transfer post-test scores.</p><p>For the state representation, we find that by combining effective feature selection methods with RL, our MDP-induced policies can be more effective than the random policy for Responsive students: for Experiment 1, while no significant difference was found between the Random-Res and Random-Unre groups, the MDP-ECR-Resp group scores significantly higher than the MDP-ECR-Unres group. For Experiment 2, while no significant difference is found among the three Unresponsive groups on the transfer post-test scores, the Ensemble-Imme-Resp group scores global median split, we combine all the students in all policy groups who were in our analysis across Experiments 1-3. Particularly, we find that students whose level1-avgstepTime &lt; 8.01 sec are less sensitive to the effectiveness of pedagogical strategies than their peers whose level1-avgstepTime &#8805; 8.01 sec. In the following section, we refer the former as the Unresponsive group and the latter as the Responsive group.</p><p>Combining Policy {WIG-det, Ensemble-Imme, Ensemble-Delay, MDP-ECR, Com-Random} with Type factor {Responsive, Unresponsive}, we have a total of 10 groups. Table <ref type="table">13</ref> shows the number of the students in each group, and a Pearson's chi-squared test indicates that there is no significant difference in the distribution of Responsive vs. Unresponsive among the five policies, &#967; 2 (4, N = 239) = 3.10, p = 0.54. In the post-hoc analysis, we compare the three MDP policies against the Com-Random policy to determine the impact of the feature selection methods. All three MDP policies (WIG-det Ensemble-Imme, and MDP-ECR) are induced by applying different feature selection methods with RL using immediate rewards, DT-Imme. Additionally, to determine the impact of the reward function we compared the Ensemble-Imme and Ensemble-Delay against Com-Random since the former two use the same feature selection method. For the impact of the reward function, the same patterns are found in the post-hoc comparison as in Experiment 2: while no significant difference is found among the three Unresponsive groups, the Ensemble-Imme-Resp significantly out-performs the Random-Resp and no significant difference is found between the Ensemble-Delay-Resp and Random-Resp. Therefore, in the following, we will focus on exploring the impact of the feature selection on RL-induced policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2.">THE IMPACT OF FEATURE SELECTION ON RL POLICIES</head><p>Table <ref type="table">14</ref> presents the mean and SD for students' pre-test and transfer post-test scores for eight groups of students: four Policies {WIG-det, Ensemble-Imme, MDP-ECR, Com-Random} &#215; 2 Types {Responsive, Unresponsive}. It is important to note that since all students are split using the new global median value, the pre-test and transfer post-test scores are different from those listed in the tables for the individual experiments.</p><p>Pre-test scores. A two-way ANOVA test using Policy and Type as two factors show that there is no significant main effect of Type, no significant interaction effect of Policy and Type, but a significant main effect of Policy on pre-test score: F (3, 201) = 3.54, p = .016. Specifically, planned contrasts using Tukey's adjustment indicate a significant difference between WIGdet and Com-Random in that the former achieved the significantly higher pre-test score than the later, t(201) = 3.22, p = .009, while there is no significant difference for other pair of policies.</p><p>Transfer Post-Test Score. To take the differences among the eight groups on the pretest into account, we run a two-way ANCOVA test, using Policy and Type as the two factors and  Table <ref type="table">15</ref> presents the results of contrast tests using Tukey's adjustment for multiple comparisons. Results indicate that WIG-det-Resp scored significantly higher than MDP-ECR-Resp and Random-Resp: p = .022 and p = .027 respectively; Ensemble-Imme-Resp achieved higher score than MDP-ECR-Resp and Random-Resp, where the difference is significant p = .045 and marginally significant p = .054 respectively. No significant difference is found between other pairs.  Unres had the significant more WE than both Ensemble-Delay-Unres and MDP-ECR-Unres. For Responsive groups, WIG-det-Resp assigned the significant more WE than both Ensemble-Delay-Resp and MDP-ECR-Resp.</p><p>As the summary, although the PS and WE counts reflect the difference of policies, it is not the key reason why Ensemble-Imme and WIG-det policies outperform Random, which requires further data analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">CONCLUSIONS, LIMITATIONS, &amp; DISCUSSION</head><p>We conducted four experiments to investigate the effectiveness of reinforcement learning induced policies using the MDP framework. Overall, an aptitude-treatment interaction effect consistently exists among Experiments 1-3 and the post-hoc comparisons. Furthermore, our students were split based on their response time, and we found the Unresponsive groups have similar learning performance under different policies employed by the ITS, whereas Responsive groups are more sensitive to the induced policies in that those under an effective policy would perform significantly better than those under an ineffective policy.</p><p>When applying RL to induce policies, we explored the impact of reward function, state representation, and policy execution. For policy execution, no significant improvement was found for the stochastic policy execution due to a ceiling effect. For future studies, the ceiling effect may be eliminated if we assign harder questions to students during the transfer post-test and adjust the grading rubric for the post-test to provide more finely grained evaluation and continuous scores.</p><p>In many domains, RL is applied with an immediate reward function. For example, in an automatic call center system, the agent can receive an immediate reward for every question it asks because the impact of each question can be assessed instantaneously <ref type="bibr">(Williams, 2008)</ref>. Immediate rewards are often chosen for RL-based policy induction because it is easier to assign appropriate credit or blame when the feedback is tied to a single decision. The more that rewards or punishments are delayed, the harder it becomes to properly assign credit or blame. However, for an ITS, the most appropriate rewards to use are student learning gains, which are typically unavailable until the entire tutoring session is complete. This is due to the complex nature of the learning process, making it difficult to assess student learning moment by moment. More importantly, many instructional interventions that boost short-term performance may not be effective over the long-term; for example, an instructional intervention may reduce the time a student spends solving a problem, but may also lead to shallow learning of the material <ref type="bibr">(Baker et al., 2004)</ref>. We explored both immediate and delayed rewards in our policy induction and empirically evaluated the impact of the induced policies on student learning. Our results show that using immediate rewards can be more effective than using delayed rewards, probably because of the vanishing reward problem: the discount factor in the MDP framework makes the rewards in the early decisions become extremely small with respect to the delayed reward.</p><p>For state representation, we explored feature selection based on the MDP framework. Although many feature selection methods such as embedded incremental feature selection <ref type="bibr">(Wright et al., 2012)</ref>, LSPI <ref type="bibr">(Li et al., 2009)</ref>, and Neighborhood Component Analysis <ref type="bibr">(Goldberger et al., 2005)</ref> can be applied to RL, most of these methods are designed for model-free RL, and we focus on model-based RL due to the high cost of collecting training data on ITSs. While correlationbased feature selection methods have been widely used for supervised learning for selecting the most relevant state features to the output label <ref type="bibr">(Hall, 1999;</ref><ref type="bibr">Yu and Liu, 2003)</ref>, in this work we explored five correlation-based metrics with two options: one option is to select the next feature that is the most correlated (High) to the currently selected feature set whereas the other option is to select the least correlated (Low). Choosing the most correlated feature may be effective since the feature is more likely to be related to decision making; however, it may not make much more of a contribution than the currently selected feature set. Alternatively, choosing the least correlated feature may raise the diversity of the feature set, enriching the state representation; however, this has the risk of selecting irrelevant or noisy features. Our results show that low correlation methods perform significantly better than high correlation methods, the RL-based approach from our previous work <ref type="bibr">(Chi et al., 2011)</ref>, and the baseline random method in terms of the expected cumulative reward (ECR). In particular, low correlation methods improve over high correlation methods as much as 142.48%, with an average of 45.2% improvement in ECR. In general, we have: Low correlation-based &gt; Ensemble &gt; High correlation-based &gt; ECR-based Random (Sec. 5.6). Empirical results from Experiments 2 and 3 show that by applying effective feature selection to MDP, the Responsive groups using an RL-induced policy can significantly outperform their peers using a random policy. Additionally, post-hoc comparison results (Sec. 8.2) show that the empirical effectiveness of policies can be ordered as: WIG-det &gt; MDP-ECR, Random (Sec. 8.2). Therefore, our results suggest that a low correlation-based feature selection approach is more effective than other feature selection methods for RL.</p><p>There are several caveats in our experiments that provide enlightenment regarding future work. First of all, we retrospectively split students into Responsive vs. Unresponsive groups using response time because we do not fully understand why the differences between Responsive vs. Unresponsive groups exist. To answer such a question, we need to perform deep log analysis for our future work. Second, although we detect different performance among the different RL-induced policies, it is still not clear what makes them effective or why they are effective. Future work is needed to shed some light on understanding the induced policies and to compare the machine induced policies with existing learning theory. Third, we mainly compare the RL-induced policies with a Random policy in our experiments and it is not clear if the same results would hold if we compare them against a stronger baseline such as those used in previous research <ref type="bibr">(McLaren and Isotani, 2011;</ref><ref type="bibr">McLaren et al., 2014;</ref><ref type="bibr">Najar et al., 2014;</ref><ref type="bibr">Salden et al., 2010)</ref>. Finally, in this work, we selected a small set of features from 133 observable state features which severely limits the effectiveness of tabular MDP methods. Many of the relevant factors such as motivation, affect, and prior knowledge, cannot be observed directly nor are they described explicitly. On the other hand, Partially-observable MDPs (POMDPs) model unobserved factors by using a belief state space. Thus POMDPs for ITSs can explicitly represent two sources of uncertainty: non-determinism in the control process and partial observability of the students' knowledge levels. In the former case the outcome of the tutorial actions and the students' knowledge levels are represented by a probability distribution, and in the latter case, the underlying knowledge levels are observed indirectly via incomplete or imperfect observations. In short, using the belief state space gives POMDP two potential advantages over MDPs: better handling of uncertainty in the state representation, and the ability to incorporate a large range of state features. As a result, we believe that POMDPs will be more effective than tabular MDPs for ITSs.</p><p>Furthermore, previous work <ref type="bibr">(Renkl, 2002;</ref><ref type="bibr">Gerjets et al., 2006;</ref><ref type="bibr">Taylor et al., 2006;</ref><ref type="bibr">Atkinson et al., 2003)</ref> has shown that adding self-explain steps in WE and PS (prompting for selfexplanation) can significantly improve students learning. In the future, we will expand our research scope on not only WE vs. PS but also on whether or not to ask students to self-explain.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Journal of Educational Data Mining, Volume 10, No 3, 2018</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Journal of Educational Data Mining, 10, No 3, 2018</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>Journal of Data Mining, Volume 10, No 3, 2018</p></note>
		</body>
		</text>
</TEI>
