<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Optimal Treatment Regimes: A Review and Empirical Comparison</title></titleStmt>
			<publicationStmt>
				<publisher>Wiley</publisher>
				<date>02/22/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10467921</idno>
					<idno type="doi">10.1111/insr.12536</idno>
					<title level='j'>International Statistical Review</title>
<idno>0306-7734</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhen Li</author><author>Jie Chen</author><author>Eric Laber</author><author>Fang Liu</author><author>Richard Baumgartner</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A treatment regime is a sequence of decision rules, one per decision point, that maps accumulated patient information to a recommended intervention. An optimal treatment regime maximises expected cumulative utility if applied to select interventions in a population of interest. As a treatment regime seeks to improve the quality of healthcare by individualising treatment, it can be viewed as an approach to formalising precision medicine. Increased interest and investment in precision medicine has led to a surge of methodological research focusing on estimation and evaluation of optimal treatment regimes from observational and/or randomised studies. These methods are becoming commonplace in biomedical research, although guidance about how to choose among existing methods in practice has been somewhat limited. The purpose of this review is to describe some of the most commonly used methods for estimation of an optimal treatment regime, and to compare these estimators in a series of simulation experiments and applications to real data. The results of these simulations along with the theoretical/methodological properties of these estimators are used to form recommendations for applied researchers.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In clinical practice, treatment decisions are made over the progression of a patient's disease and thus may depend on baseline information as well as evolving treatment and outcome history. A treatment regime is a sequence of decision rules, one per decision point, that maps accumulated patient information to a recommended intervention <ref type="bibr">(Chakraborty &amp; Moodie, 2013;</ref><ref type="bibr">Murphy, 2003;</ref><ref type="bibr">Kosorok &amp; Moodie, 2015;</ref><ref type="bibr">Robins, 2004;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>. An optimal treatment regime optimises expected cumulative utility when applied to select treatments in the population of interest (for other notions of optimality, see Remark 1). Thus, an optimal treatment regime improves the overall quality of healthcare by personalising intervention decisions to the uniquely evolving health status of each patient. Identification of an optimal treatment regime is closely tied to the problem of subgroup identification, where the goal is to identify patient characteristics that are associated with a favourable response to a given treatment <ref type="bibr">(Ballarini et al., 2018;</ref><ref type="bibr">Lipkovich et al., 2017)</ref>.</p><p>The potential to improve both the quality and efficiency of healthcare has generated tremendous interest in precision medicine. As treatment regimes are the primary mathematical formalisation of precision medicine in biomedical research, there has been a surge of methodological work focused on estimation of optimal treatment regimes from observational and randomised study data (reviews include <ref type="bibr">Chakraborty &amp; Moodie, 2013;</ref><ref type="bibr">Chakraborty &amp; Murphy, 2014;</ref><ref type="bibr">Clifton &amp; Laber, 2020;</ref><ref type="bibr">Kosorok &amp; Laber, 2019;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>. Within the treatment regimes community, estimators are often classified into one of the following three groups: (i) regression-based, (ii) direct-search, or (iii) model-based planning. As with most coarse categorisations, these groupings are imperfect, and hybrid approaches exist that do not fit cleanly into one of these categories; nevertheless, it is useful to review these broad classes as they (or combinations thereof) cover the vast majority of existing estimators. Regression-based approaches estimate the optimal treatment regime using approximate dynamic programming, implemented through a series of regression models for the conditional mean at each stage, given patient history and treatment. The most widely used regression-based approach is Q-/A-learning <ref type="bibr">(Murphy, 2003;</ref><ref type="bibr">2005b;</ref><ref type="bibr">Qian &amp; Murphy, 2011;</ref><ref type="bibr">Robins, 2004;</ref><ref type="bibr">Schulte et al., 2014)</ref> along with its many variants <ref type="bibr">(Chakraborty et al., 2010;</ref><ref type="bibr">Ertefaie &amp; Strawderman, 2018;</ref><ref type="bibr">Ertefaie et al., 2021;</ref><ref type="bibr">Goldberg &amp; Kosorok, 2012;</ref><ref type="bibr">Moodie &amp; Richardson, 2010;</ref><ref type="bibr">Moodie et al., 2014;</ref><ref type="bibr">Laber et al., 2014;</ref><ref type="bibr">Leqi &amp; Kennedy, 2022;</ref><ref type="bibr">Song et al., 2015;</ref><ref type="bibr">Zhao et al., 2011;</ref><ref type="bibr">Zhou &amp; Kosorok, 2017)</ref>. Direct-search methods, also known as value-search methods, construct an estimator of the mean cumulative utility as a function of the treatment regime and then choose the maximiser over a pre-specified class of regimes as the estimated optimal regime. Early work on direct-search estimators includes marginal mean models <ref type="bibr">(Orellana et al., 2010a;</ref><ref type="bibr">2010b)</ref> and classification-based estimators <ref type="bibr">(Rubin &amp; van der Laan, 2012;</ref><ref type="bibr">Zhang et al., 2012;</ref><ref type="bibr">Zhao et al., 2012)</ref>. These methods have since been extensively studied and generalised (see <ref type="bibr">Huang et al., 2019;</ref><ref type="bibr">Laber &amp; Zhao, 2015;</ref><ref type="bibr">Luckett et al., 2020;</ref><ref type="bibr">Zhang &amp; Zhang, 2018;</ref><ref type="bibr">Zhao et al., 2019;</ref><ref type="bibr">Zhang &amp; van der Schaar, 2020, and references therein)</ref>. Model-based planning methods estimate the data-generating distribution and then use either estimating equations or Monte Carlo methods to identify an optimal regime. Within the statistics literature, g -computation is the primary model-based planning method <ref type="bibr">(Robins, 1986;</ref><ref type="bibr">Robins et al., 1997;</ref><ref type="bibr">Yu &amp; van der Laan, 2002)</ref> and underpins fully Bayesian approaches to estimation of an optimal treatment regime <ref type="bibr">(Guan et al., 2020;</ref><ref type="bibr">Lee et al., 2015;</ref><ref type="bibr">Saarela et al., 2015;</ref><ref type="bibr">Thall et al., 2000;</ref><ref type="bibr">Thall et al., 2007;</ref><ref type="bibr">Xu et al., 2016)</ref>. In the reinforcement learning literature, model-based planning methods are typically used when parsimonious models of the system dynamics are available (see <ref type="bibr">Ghavamzadeh et al., 2015;</ref><ref type="bibr">Sutton &amp; Barto, 2018</ref>, for a review and further references).</p><p>The large and diverse collection of methodologies to estimate an optimal treatment regime reflects the importance of precision medicine and the increasing maturity of the field. Nevertheless, the field is continuing to evolve and a thorough empirical study designed to evaluate if and when certain methods perform favourably is lacking. In this review, we provide a brief description of a subset of the methods introduced in the preceding paragraph and compare them in a series of simulation experiments and applications. Thus, our intent is to provide an accessible review together with some guidance about how to choose among these methods based on empirical evidence.</p><p>In Section 2, we set notation and formalise the notion of an optimal treatment regime. In Section 3, we expound a series of estimators of the optimal treatment regime. In Section 4, we present the results of our simulation experiments. Two real-world data applications are presented in Section 5. A discussion of the simulations and data applications is provided in Section 6, and final conclusions are given in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Setup and Problem Formulation</head><p>Suppose that we are planning for the next K treatment decisions in the course of a patient's healthcare. We formalise such a treatment plan as a treatment regime: a sequence of K decision rules, one per treatment decision, each mapping up-to-date patient information to a recommended treatment <ref type="bibr">(Chakraborty &amp; Moodie, 2013;</ref><ref type="bibr">Murphy, 2003;</ref><ref type="bibr">Robins, 2004;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>. We assume that the available data are composed of n patient trajectories, each an independent copy of S 1 ; A 1 ; S 2 ; A 2 ; &#8230;; S K ; A K ; Y f g , where S 1 &#8712; S 1 denotes baseline patient information; A k &#8712; A k denotes the treatment assigned at time point k &#188; 1; &#8230;; K; S k &#8712; S k denotes interim patient information collected during the course of treatment A k &#192; 1 for k &#188; 2; &#8230;; K ; and Y &#8712; &#8477; denotes the outcome of interest, measured after the final treatment A K , which is coded so that higher values correspond to more desirable health outcomes. The set of available treatments, A k ; k &#188; 1; &#8230;; K, is typically finite, although in some applications, for example, dose finding or treatment timing, this might be modelled as a continuum <ref type="bibr">(Chen et al., 2016;</ref><ref type="bibr">Guo &amp; Yuan, 2017;</ref><ref type="bibr">Laber &amp; Zhao, 2015;</ref><ref type="bibr">Nie et al., 2020;</ref><ref type="bibr">Schulz &amp; Moodie, 2020;</ref><ref type="bibr">Rich et al., 2014;</ref><ref type="bibr">Zhu et al., 2020)</ref>. Denote patient history by S k &#188; &#240;S 1 ; &#8230;; S k &#222; and treatment history by</p><p>is the information available to the decision maker at time k &#188; 1; &#8230;; K, where we define H 1 &#8801;S 1 and H 1 &#8801;S 1 for convenience. A treatment regime d &#188; &#240;d 1 ; &#8230;;</p><p>is a sequence of decision rules such that d k : H k &#8594;A k is a mapping from available patient information to a recommended treatment. Thus, under d, a patient with history H k &#188; h k at time k will be recommended the intervention d k &#240;h k &#222;. In general, the observed data are not collected under an optimal treatment regime (lest we could forgo the entire enterprise), and thus, we need additional structure that will allow for identification of the optimal regime in terms of the data-generating model (see <ref type="bibr">Luckett et al., 2017;</ref><ref type="bibr">Wallace et al., 2018</ref>, for a discussion of modelling clinician behaviour in observational data). We state these assumptions using the potential outcomes framework <ref type="bibr">(Rubin, 1978;</ref><ref type="bibr">Robins, 1986)</ref>.</p><p>Define S * k &#240;&#257; k &#192; 1 &#222; to be the potential covariate information under treatment sequence &#257; k &#192; 1 at time k &#188; 2; &#8230;; K and let Y * &#240;&#257; K &#222; be the potential outcome under treatment sequence &#257; K . The potential history at stage k under &#257; k &#192; 1 is, therefore,</p><p>. The set of all potential outcomes is</p><p>where we have defined S * 1 &#240;a 0 &#222;&#8801;S 1 for convenience. For any treatment regime</p><p>where 1&#240; &#8226; &#222; is the indicator function. For any regime, d &#8712; D, define V &#240;d&#222; &#188; &#58628;Y * &#240;d&#222; to be the mean outcome under d. The function V, termed the value function, is the primary measure of the performance of a regime. We say a regime,</p><p>This definition makes plain that our focus is on estimating an optimal regime within a given (pre-specified) class of regimes. This class may be chosen to be parsimonious, for example, linear regimes or those representable as trees or lists <ref type="bibr">(Laber &amp; Zhao, 2015;</ref><ref type="bibr">Lakkaraju &amp; Rudin, 2017;</ref><ref type="bibr">Qian &amp; Murphy, 2011;</ref><ref type="bibr">Sies &amp; Van Mechelen, 2017;</ref><ref type="bibr">Tao et al., 2018;</ref><ref type="bibr">Zhu et al., 2017;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Zhang et al., 2015)</ref>, or may be a large and flexible class, for example, a Reproducing Kernel Hilbert Space generated by a universal kernel or other machine learning methods <ref type="bibr">(Luedtke &amp; van der Laan, 2016b;</ref><ref type="bibr">Liu et al., 2018;</ref><ref type="bibr">Zhao et al., 2011;</ref><ref type="bibr">Zhao et al., 2012)</ref>. To characterise the optimal regime in terms of the data-generating distribution, we make the following assumptions:</p><p>C3) sequential ignorability, A k &#8869;WjH k for k &#188; 1; &#8230;; K. We further assume that there is no interference between units and that there are not multiple versions of treatment. These assumptions are standard and have been extensively studied in the literature. For additional discussion see <ref type="bibr">Tsiatis et al. (2019)</ref>.</p><p>Throughout, we assume for simplicity that all interventions in A k are allowable (feasible) for all patients at stage k; in general, this may not hold as the set of allowable treatments may depend on a patient's health status, for example, this set may depend on the presence of co-morbid conditions, patient preference, cost, presence/absence of a biomarker, etc. The methods presented here all generalise immediately to this more general setting (see <ref type="bibr">Tsiatis et al., 2019;</ref><ref type="bibr">van der Laan &amp; Petersen, 2007)</ref>.</p><p>Remark This review, in keeping with vast majority the precision medicine literature, defines an optimal regime as maximising the marginal mean outcome in the target population. However, the mean is not always the best measure of performance. One concern with using the mean is that it is sensitive to extreme observations, so that optimal regime might be heavily influenced by rare events. Another is that it may not align with the clinical goals in a particular application; for example, if the goal is to protect against poor outcomes, one might instead prefer to maximise a lower quantile of the outcome distribution. <ref type="bibr">(Linn et al., 2015;</ref><ref type="bibr">2017)</ref> and <ref type="bibr">Wang et al. (2018)</ref> consider estimating a regime which maximises the marginal median of the potential outcome distribution, that is, d opt m &#188; argmax D Median Y * &#240;d&#222; f g (see also <ref type="bibr">Kallus et al., 2019)</ref>. Recently, <ref type="bibr">Leqi &amp; Kennedy (2022)</ref> noted that d opt m has the rather undesirable property that d * m &#240;x&#222; can depend on the outcome distribution for patients with X &#8800; x. As an alternative, they propose to estimate the regime which maximises the average conditional median &#58628; Median Y * &#240;d&#222;jX f g f g . Other criterion used to define an optimal regime include maximising efficacy subject to a constraint on side-effects <ref type="bibr">(Linn et al., 2015;</ref><ref type="bibr">Laber et al., 2018;</ref><ref type="bibr">Wang et al., 2018)</ref> or resources <ref type="bibr">(Caniglia et al., 2021;</ref><ref type="bibr">Luedtke &amp; van der Laan, 2016c)</ref>; maximising choice <ref type="bibr">(Ertefaie et al., 2016;</ref><ref type="bibr">Fard &amp; Pineau, 2011;</ref><ref type="bibr">Laber et al., 2014;</ref><ref type="bibr">Lizotte &amp; Laber, 2016;</ref><ref type="bibr">Wu, 2016)</ref>; and maximising patient (latent) utility <ref type="bibr">(Butler et al., 2018;</ref><ref type="bibr">Luckett et al., 2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods for Optimal Treatment Regies</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Model-based Planning</head><p>The first approach to estimating an optimal treatment regime that we consider involves modelling the complete trajectory distribution under any allowable treatment regime. From such a model, one can derive the marginal mean outcome under any regime d, that is, the value function, V &#240;d&#222; &#188; &#58628;Y * &#240;d&#222;, and consequently d opt &#188; argmax d &#8712; D V &#240;d&#222;. Using the g-computation formula <ref type="bibr">(Chakraborty &amp; Moodie, 2013;</ref><ref type="bibr">Robins, 1986;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>, the marginal density of Y * &#240;d&#222; is obtained by first factoring the joint trajectory distribution under d and integrating out over the history. In particular, the density of Y * &#240;d&#222; at y is</p><p>where</p><p>Each of these densities is identifiable in the observed data and can be estimated using standard</p><p>methods, for example, maximum likelihood. Plugging these estimators into the integrand in (1) yields an estimator b f Y * &#240;d&#222;; n of f Y * &#240;d&#222; . The estimated optimal regime is thus b d n &#188; argmax d &#8712; D &#8747;y b f Y * &#240;d&#222;; n &#240;y&#222;dy . In many realistic applications, it is not possible to obtain b f Y * &#240;d&#222;; n in closed form as the integral in (1) is intractable. Instead, one must use Monte Carlo methods whereby one simulates trajectories under any candidate regime d and then averages the simulated outcomes to form an estimator of the associated marginal mean outcome. If D is complex, one may need to employ sophisticated stochastic optimisation methods to construct an estimator (Gosavi <ref type="bibr">et al., 2015)</ref>; we will not discuss such computational details here. Implementation of the g -computation formula requires positing models for each of the conditional densities in (1); thus, this approach is best suited to situations where there is either sufficient data to support flexible estimators of these densities or strong underlying clinical theory to inform parsimonious parametric models. When high-quality models can be constructed for these densities, g-computation can have good empirical performance (see <ref type="bibr">Chakraborty &amp; Moodie, 2013;</ref><ref type="bibr">Sutton &amp; Barto, 2018;</ref><ref type="bibr">and;</ref><ref type="bibr">Tsiatis et al., 2019</ref> for textbook-level reviews; for recent applications see; <ref type="bibr">Laber et al., 2018;</ref><ref type="bibr">Guan et al., 2020;</ref><ref type="bibr">and;</ref><ref type="bibr">Josefsson &amp; Daniels, 2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">Model-based planning implementation and extensions</head><p>For the purpose of illustration we describe a simple implementation of model-based planning using location-scale models <ref type="bibr">(Carroll &amp; Ruppert, 1988)</ref> for the requisite conditional distributions and a non-parametric model for the distribution of the first stage history. The extension to more complex models is straightforward. We assume that</p><p>and &#1013; k &#8712; &#8477; p k is an independent additive error with mean zero and identity covariance. One can estimate &#952; k; 0 using (nonlinear) least squares, that is,</p><p>where jj &#8226; jj 2 denotes the Euclidean norm. Similarly, one can estimate &#952; k; 1 using</p><p>where jj &#8226; jj F denotes the Frobenius norm. Define the standardised empirical residuals b &#1013; k;</p><p>and define b f &#1013;; k; n to be the empirical distribution of the standardised empirical residuals, that is, b</p><p>Note that, in some settings, one might prefer to model the standardised residuals using a Gaussian copula or other suitable multivariate model.</p><p>For the conditional distribution of the outcome Y given H K and A K we posit a location-scale model of the form</p><p>where &#956; Y :</p><p>and &#951; Y &#8712; &#8477; is an independent additive error with mean zero and unit variance. Estimators b &#952; Y ; 0; n and b &#952; Y ; 1; n of &#952; Y ; 0 and &#952; Y ; 1 are constructed using (nonlinear) least squares as described above, that is,</p><p>For each i &#188; 1; &#8230;; n, define the standardised empirical residuals</p><p>the empirical distribution of these standardised residuals or some other suitable estimator of the distribution of &#951; Y . A draw from the estimated conditional distribution of Y given H K &#188; h K and A K &#188; a K , say &#7928;, is thus obtained by drawing &#951;Y from b f &#951;; Y ; n and setting &#7928; &#188; &#956; Y &#240;h K ; a K ; b &#952; Y ; 0; n &#222;&#254;&#963; Y &#240;h K ; a K ; b &#952; Y ; 1; n &#222;&#951; Y . Given estimated densities for the histories and outcome, we can now use Monte Carlo sampling with the g-computation formula to construct an estimator b V n &#240;d&#222; for any regime d. Let d be fixed, and let B be large positive integer. An estimator b V n &#240;d&#222; is constructed as follows: (a) Generate H &#240;1&#222; 1 ; &#8230;; H &#240;B&#222; 1</p><p>Finally, given a class of regimes D one can compute b</p><p>V n &#240;d&#222;. In two-stage decision problems, the use of mean-variance models can be combined with Q -learning (described below) to avoid estimating multivariate conditional densities for the histories <ref type="bibr">(Laber et al., 2014;</ref><ref type="bibr">Linn et al., 2017)</ref>. However, the use of location-scale models, while convenient, is not necessary. One alternative is to use non-parametric Bayesian models, which have shown strong empirical performance in a number of complex domains <ref type="bibr">(Xu et al., 2016;</ref><ref type="bibr">Laber et al., 2018;</ref><ref type="bibr">Guan et al., 2020)</ref>. For a survey of model-based planning approaches beyond biomedical applications, see <ref type="bibr">Polydoros &amp; Nalpantidis (2017)</ref>, <ref type="bibr">Wang et al. (2019)</ref>, and references therein.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Regression-based Methods</head><p>The underlying generative model is not needed to identify the optimal treatment regime. In this and the following subsections, we express the optimal regime using estimating equations that require modelling only part of the generative model. These approaches are desirable in settings where there is insufficient data or underlying theory to inform a high-quality model for patient trajectories or if one wants to reduce bias by imposing less structure on the data-generating model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Q-learning</head><p>Q-/A-learning are the most commonly used methods to estimate an optimal treatment regime in biomedical applications. We describe Q-learning first as it is easy to implement, flexible, extensible, and robust to some forms of model misspecification <ref type="bibr">(Schulte et al., 2014)</ref>. Q-learning is a regression-based approximate dynamic programming method that is derived from the Bellman optimality equations <ref type="bibr">(Murphy, 2005b;</ref><ref type="bibr">Sutton &amp; Barto, 2018;</ref><ref type="bibr">Clifton &amp; Laber, 2020)</ref>. Define the Q-function at stage K as</p><p>(2) so that Q K &#240;h K ; a K &#222; is the mean outcome for a patient who presents with history H K &#188; h K and is assigned intervention A K &#188; a K at stage K. The mean outcome for a patient presenting with history H K and who is assigned treatment according to decision rule d K is thus</p><p>g , and we define the optimal decision rule in D K at stage K to be</p><p>(3)</p><p>Thus, the optimal regime in D can be characterised through the Q-functions, which are identifiable in terms of the data-generating model. Furthermore, Equations (2) and (3) immediately suggest an estimator of d opt . First, one can construct an estimator b Q K; n of Q K by estimating the regression of Y on H K and A K and subsequently computing the plug-in estimator</p><p>(5) where &#8473; n denotes the empirical expectation operator. Using (3), one can construct an estimator</p><p>The estimator (6) allows for arbitrary regression models for the Q-functions, for example, these could be simple parametric models or more flexible machine learning models, which can be chosen independently from the class of regimes D (however, one should choose the regression model to be sufficiently expressive to capture the complexities in D , see <ref type="bibr">Zhang et al., 2012;</ref><ref type="bibr">Zhang et al., 2015;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Zhang &amp; Zhang, 2018)</ref>.</p><p>When the set of regimes is unrestricted, for example, if D k contains all measurable maps from</p><p>is the optimal decision rule at the last stage and, recursively for</p><p>This expression for the optimal regime suggests an alternative estimation procedure in which D is implicitly defined by the class of models posited for the Q-functions. That is, one first constructs an estimator b Q K; n of Q K by estimating the regression of Y on H K and A K as described above. The estimated optimal decision rule at stage K is then taken to be b</p><p>When estimation of the Q-functions is carried out this way, it is plain to see that the models for the Q-functions and the class of regimes under consideration are conflated. On one hand, this is natural as the Q-functions determine the optimal (unrestricted) regime and thus should determine the class of possible regimes. On the other hand, to generate new clinical insights or to inform actual clinical decision making it is often necessary that the estimated regimes be interpretable to domain experts. Thus, if one wants a parsimonious regime when taking this approach, one would need to posit a parsimonious set of Q-functions, which may raise concerns about model misspecification. These concerns have, at times, caused some confusion about Q-learning forcing an unpleasant choice between model misspecification (via parsimonious models) or unintelligible decision rules (via flexible models). However, as we have seen, these concerns are largely unjustified as one can specify a flexible class of models for the Q-functions while enforcing an interpretable or parsimonious class of regimes through specification of D (see <ref type="bibr">Zhang et al., 2012;</ref><ref type="bibr">Taylor et al., 2015;</ref><ref type="bibr">Zhang &amp; Zhang, 2018;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Q-learning implementation and extensions</head><p>One of the most appealing aspects of Q-learning when the class of regimes is unrestricted (and regression-based procedures in general) is that it can be implemented easily using standard statistical software. All that is required for implementation is the ability to fit regression models and to maximise the fitted regression models over a class of decision rules. Of course, the difficulty associated with each of these steps depends on the complexity of the models and the class of decision rules. To fix ideas, we first describe linear Q-learning with a class of linear decision rules induced by the form of the Q -functions. Throughout, we assume that</p><p>denote a known feature mapping, for example, this may contain interaction terms or nonlinear basis expansions. We posit models of the form</p><p>We consider decision rules of the form</p><p>this class of decision rules and suppose that</p><p>The estimated optimal regime can thus be constructed using a series of regressions as follows. Define b &#946; K; n to be the least squares estimator</p><p>and subsequently define b</p><p>We now consider a more complex example in which the Q-functions are estimated using a non-parametric class and the class of decision rules is linear. We show that identifying the optimal decision rule in this context can be recast as a mixed integer program. Thus, it requires the use of specialised optimisation libraries and may become prohibitively expensive in large problems or when trying to optimise over all stages jointly (rather than using the one-stage-at-a-time heuristic considered here). In these cases, one can use stochastic search algorithms or weighted classification methods, which we also discuss briefly.</p><p>Let Q k ; k &#188; 1; &#8230;; K denote a class of models for Q k , for example, these could be neural networks, random forest, super learners <ref type="bibr">(Polley &amp; Van Der Laan, 2010)</ref>, or linear models with nonlinear basis functions. Consider linear decision rules as before, for example,</p><p>We describe the procedure at stage K as the procedure for stages</p><p>and for any decision rule</p><p>which can be seen to be a mixed integer program. Such problems can be solved to machine precision using specialised optimisation libraries such as CPLEX or XPRESS <ref type="bibr">(Anand et al., 2017;</ref><ref type="bibr">Rudin &amp; Ertekin, 2018)</ref>, but run-times can be excessive when the dimension is large (for example, see <ref type="bibr">Laber &amp; Murphy, 2011)</ref>. In cases where computation is not feasible, one can use stochastic search methods, for example, simulated annealing, genetic algorithms, or simultaneous perturbation <ref type="bibr">(Spall, 2005)</ref>.</p><p>Another approach to computing argmax &#947; K b V K; n &#240;&#947; K &#222; is to recast it as a weighted classification problem and then to use existing classification algorithms to approximate a solution <ref type="bibr">(Zhang et al., 2012;</ref><ref type="bibr">Zhang et al., 2013;</ref><ref type="bibr">Zhao et al., 2012;</ref><ref type="bibr">Zhang &amp; Zhang, 2018)</ref></p><p>g may be the class of linear decision rules as discussed above or it may be the set of rules representable as trees or decision lists. Write arg max</p><p>(7)</p><p>where we have used the fact that maximisation is equivalent to minimising the negative of the same objective and the fact that the term max &#8467; b Q K; n &#240;H K ; &#8467;&#222; does not affect the argmin as it does not depend on &#947; K . It can be seen that each term in ( <ref type="formula">7</ref>) is non-negative and is equal to zero when</p><p>&#8467;&#222;. Thus, one can view the minimisation problem in ( <ref type="formula">7</ref>) is as a weighted classification problem in which the 'correct class' at history H K &#188; h K is given by argmax &#8467; b Q K; n &#240;H K ; &#8467;&#222; (with an element chosen arbitrarily if the argmax is not a singleton) and the cost associated with predicting class a at</p><p>that is, the input is H K , the correct label is argmax &#8467; b Q K; n &#240;H K ; &#8467;&#222;, and the associated vector of</p><p>. This data can then be fed into any multi-class classification algorithm that accommodates sample weights (see <ref type="bibr">Zadrozny et al., Zadrozny et al.;</ref><ref type="bibr">Abe et al., 2004</ref>, and references therein for methods that can be used to make any classification algorithm cost-sensitive).</p><p>Remark Note that the classification dataset we constructed was only at the observed histories</p><p>; however, one could generate an arbitrary number of histories from any surrogate distribution, in particular, one might generate histories from the estimated density of H K computed using g-computation under the estimated optimal strategy. The reasons one might do this are twofold. First, the empirical distribution of H K is under the data-generating model and not the counterfactual distribution induced by the optimal decision rules at times k &#188; 1; &#8230;; K &#192; 1; consequently, optimising b</p><p>V n &#240;&#947; K &#222; over &#947; K needs not lead to a global optimiser unless the class of rules indexed by &#947; K includes the globally optimal rule h K &#8614;argmax &#8467; Q K &#240;h K ; &#8467;&#222;. Using the estimated distribution could restore global consistency if the g-computation model is correctly specified. Second, one can generate many more than n data points thereby potentially reducing variance <ref type="bibr">(Breiman &amp; Shang, 1996)</ref>. Note that using inverse probability weighting (IPW) or augmented IPW (AIPW) will re-weight the observed histories to ensure the expectation is taken with respect to the correct counterfactual distribution; however, because these estimators average over histories that are consistent at previous stages, they can quickly run out of data in multi-stage problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3">Other methods</head><p>We have reviewed a select subset of estimators that we felt illustrated key ideas in regression-based estimation of an optimal treatment regime. In this section, we catalogue several important estimators which were omitted from our detailed discussion along with references for further study.</p><p>A-learning. A-learning <ref type="bibr">(Murphy, 2003;</ref><ref type="bibr">Schulte et al., 2014</ref>) is a doubly-robust alternative to Q-learning. For simplicity, we consider A k &#188; f0; 1g and suppose that the set of regimes is unrestricted. In this setting, A-learning targets the so-called contrast function</p><p>A description of the A -learning estimator and discussion of its merits relative to Q-learning can be found in <ref type="bibr">Schulte et al. (2014)</ref>.</p><p>Targeted maximum likelihood. van der Laan &amp; Luedtke (2014) proposed a regression method to estimate the blip function by super learning, which is an ensemble machine learning approach. In their setting, the decision rule d k depends on H k only through &#240;X k ; A k &#192; 1 &#222;, where X k is a summary function of H k . This approach can consider a more interpretable class of regimes, and it is sometimes more computationally efficient to estimate the optimal treatment regime within the restricted class. For convenience, we consider A &#188; f0; 1g and the problem of a single decision point, that is, K &#188; 1. In this case, the blip function is defined as</p><p>, where x 1 is a summary function of h 1 . The authors proposed a loss function for the blip function based on IPW and performed regression by applying super learners (see Section 3 of van der Laan &amp; Luedtke, 2014). Under this approach, the estimated optimal decision rule is b</p><p>, where b B n is the estimator of the blip function. The method is shown to possess the double-robustness property, that is, if either Q k &#240;h k &#222; or P&#240;a k jh k &#222; is correctly specified, the proposed loss function is shown to be valid in that the true blip function is the minimiser of the loss function.</p><p>Robust Q-learning. <ref type="bibr">Ertefaie et al. (2021)</ref> proposed a variant of Q-learning, termed robust Q-learning, which posits a parametric model for the contrasts C k &#240;h k &#222; and non-parametric models for the main effects</p><p>Because non-parametric models are used for the main effects, robust Q-learning is less prone to misspecification than fully-parametric Q-learning. Using sample splitting, <ref type="bibr">Ertefaie et al. (2021)</ref> obtain asymptotic normality for the coefficients indexing the contrast functions even if the estimated main effects converge at sub-parameteric rates. Robust Q-learning is an example of double-machine learning which has become an extremely active area of research in semi-parametric methods as of late (see <ref type="bibr">Chernozhukov et al., 2018;</ref><ref type="bibr">Semenova &amp; Chernozhukov, 2020;</ref><ref type="bibr">Kennedy, 2022, and references therein)</ref>.</p><p>Remark In this review, we have primarily focused on settings in which one had available a parsimonious representation of the history. However, in some settings, one may wish to tailor treatment using genetic, functional, imaging, or other high-dimensional data <ref type="bibr">(McKeague &amp; Qian, 2011;</ref><ref type="bibr">2014;</ref><ref type="bibr">Ciarleglio et al., 2016;</ref><ref type="bibr">Laber &amp; Staicu, 2018;</ref><ref type="bibr">Ciarleglio et al., 2018)</ref>. In such settings, one can apply penalised variants of many of the methods discussed in this review by simply adding a sparse penalty, for example, a lasso-or elastic net-type penalty to the estimation procedure. Examples include penalised Q-learning <ref type="bibr">(Qian &amp; Murphy, 2011;</ref><ref type="bibr">Zhu et al., 2019)</ref>, A-learning <ref type="bibr">(Shi et al., 2018)</ref>, and value-search <ref type="bibr">(Song et al., 2015)</ref>. One interesting exception is the penalised Q-learning method of <ref type="bibr">Song et al. (2015)</ref> which shrinks the treatment contrast in Q-learning rather than the coefficients themselves which induces a kind of fused-lasso effect (see also <ref type="bibr">Chakraborty et al., 2010;</ref><ref type="bibr">Moodie &amp; Richardson, 2010)</ref>. In our simulation study, we examine Q-learning with a lasso penalty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Value-Search</head><p>The Q -function representation given in Section 3.2 characterises the optimal regime recursively through the so-called optimal Q -functions (the qualifier 'optimal' reflects the fact that the Q-functions are maximised at each step). However, with only slight modification, this recursion can be used to identify the map d&#8614;V &#240;d&#222; and subsequently to construct an estimator d&#8614; b</p><p>V n &#240;d&#222; from which we obtain an estimated optimal regime b d n &#188; argmax d&#8712;D b</p><p>V n &#240;d&#222;. Estimated optimal regimes constructed by maximising an estimator b V n of V over a class of regimes are known as value-search or policy-search estimators <ref type="bibr">(Sutton &amp; Barto, 2018;</ref><ref type="bibr">Tsiatis et al., 2019)</ref>. The form of a value-search estimator is appealing in that it mimics the optimality criterion that defines an optimal regime. Furthermore, given a consistent estimator b</p><p>V n of V, using a value-search estimator directly targets the best regime in the class D regardless of whether this class contains the optimal regime within the class of all measurable maps from history to actions (see <ref type="bibr">Qian &amp; Murphy, 2011</ref>, for further discussion).</p><p>In the remainder of this section, we describe how to use Q-functions to obtain an estimator b V Q n &#240;d&#222; of any regime. We then describe an alternative formulation using inverse probability weighting (IPW), which does not require models for the Q-functions. Next, we discuss augmented inverse probability weighting (AIPW), which combines these estimators and enjoys the double-robustness property. Finally, we introduce some other common value-search approaches such as C-learning and DR-IPCW. We focus on two-stage problems K &#188; 2 and binary treatments coded so that A k &#188; 0; 1 f g for k &#188; 1; 2; the fully general case is described in <ref type="bibr">Tsiatis et al. (2019)</ref>.</p><p>Let d &#188; &#240;d 1 ; d 2 &#222; be an arbitrary regime. As in Section 3.2, define the terminal (second stage)</p><p>Rather than maximising, we simply plug-in the second stage decision rule to obtain the first stage Q-function under second stage decision rule d 2 as follows</p><p>is the expected outcome for a patient presenting with H 1 &#188; h 1 , assigned treatment A 1 &#188; a 1 and then treated with d 2 at the second stage. The value of d is, therefore,</p><p>As mentioned previously, one is free to select regression models that are deemed appropriate based on domain science and/or exploratory analyses.</p><p>While flexible models can be used for the Q-functions, choosing such models may not always be straightforward and numerous authors have expressed concerns about misspecification (e.g. <ref type="bibr">Zhao et al., 2012</ref>). An alternative approach, which we now review, is to use AIPW, which is consistent as long as either the treatment assignment mechanism (the propensity scores) or the Q-functions are correctly specified. Furthermore, such estimators enjoy a number of important theoretical advantages when flexible models are used for the propensities or the Q-functions <ref type="bibr">(Chernozhukov et al., 2017;</ref><ref type="bibr">Zhang &amp; Zhang, 2018)</ref>.</p><p>For each a k and history h k , define the propensity score</p><p>denote the estimated propensities, for example, these could be constructed using multinomial logistic regression. The inverse probability weighted estimator (IPWE) of the V &#240;d&#222; is</p><p>which can be seen to be the Horvitz-Thompson estimator of the mean outcome under d <ref type="bibr">(Horvitz &amp; Thompson, 1952)</ref>. One could construct an estimator of the optimal treatment regime via</p><p>However, the IPWE is highly unstable as it only uses a small fraction of the observed data, for example, if data were collected in a two-stage binary treatment sequential multiple assignment randomised trial (SMART <ref type="bibr">Murphy, 2005a)</ref> then only a quarter of the observed data would be used in the IPWE (in expectation). For this reason, it is common to augment the IPWE with a regression term to gain efficiency. The Augmented IPWE (AIPWE) of V &#240;d&#222; is given by b</p><p>where b Q 1; n and b Q 2; n are the estimated Q-functions as described previously. The estimated optimal regime is given by b</p><p>The AIPWE is doubly-robust in that it is consistent if either the propensity scores or Q-functions are correctly specified and furthermore attains the semi-parametric efficiency bound when both are correct (for reviews, see <ref type="bibr">Tsiatis (2007)</ref>, <ref type="bibr">Molenberghs et al. (2014)</ref>, <ref type="bibr">&amp; Tsiatis et al. (2019)</ref> and for some interesting recent developments see <ref type="bibr">Luedtke et al. (2017)</ref> and references therein).</p><p>C-learning <ref type="bibr">(Zhang &amp; Zhang, 2018</ref>) is another value-search method, which can be viewed as a weighted classification problem considering A k &#188; f0; 1g. The loss function is defined as a 0-1 loss multiplied by a weight of the contrast function, and the optimal decision rule minimises the weighted loss:</p><p>where C k is the contrast function given in (8), k &#188; 1; 2. Estimation for d opt k reduces to constructing an estimator b C k; n of C k , which can be carried out using regression as shown in <ref type="bibr">Zhang et al. (2012)</ref>. The estimated optimal decision rule at stage k is then given by</p><p>The method of double robust inverse probability of treatment and censoring weighting (DR-IPCW) is an alternative value-search method (van der Laan &amp; Luedtke, 2014). The authors proposed a loss function for d &#8712; D with expectation equaling to &#192;&#58628;Y * &#240;d&#222; if either Q k &#240;h k ; a k &#222; or P&#240;a k jh k &#222; is correctly specified. Thus, estimating the optimal decision rule d opt is equivalent to minimising the loss function over D.</p><p>Other direct optimisation methods include targeted learning (van der Laan &amp; Rose, 2011; 2018), decision lists <ref type="bibr">(Zhang et al., 2018)</ref>, outcome weighted learning <ref type="bibr">(Zhao et al., 2015)</ref>, quantile learning <ref type="bibr">(Wang et al., 2018)</ref> and marginal mean models <ref type="bibr">(Orellana et al., 2010a;</ref><ref type="bibr">2010b)</ref>. Because the majority of the above methods directly estimate the expected outcome across a class of regimes, an estimator of &#58628;Y * &#240;d opt &#222; can also be obtained by applying these methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Value-search implementation and extensions</head><p>Implementation details in the estimation of the Q-functions has already been discussed in Section 3.2.2.; thus, the remaining considerations are (i) estimation of the propensity scores and (ii) optimisation over the class of candidate regimes. Estimation of the propensity scores can be carried out using any classification algorithm that produces probability estimates. Logistic regression is often used in practice although one could use kernel-methods, neural networks, or boosting (possibly with Platt scaling, etc.) There is currently a rich line of research on the use of sample-splitting to obtain ffiffi ffi n p -consistent and asymptotically normal estimators when flexible (e.g. machine learning) models are used to estimate the propensity score and the Q-functions <ref type="bibr">(Naimi &amp; Kennedy, 2017;</ref><ref type="bibr">Chernozhukov et al., 2017;</ref><ref type="bibr">Chernozhukov et al., 2018;</ref><ref type="bibr">Semenova &amp; Chernozhukov, 2020)</ref>; however, the development of these methods is ongoing and rather technical so we do not include them in the present review.</p><p>Given an estimator b V n &#240;d&#222; , for example, constructed using (A)IPWE or Q -learning, one must then construct an approximate maximiser of d&#8614; b</p><p>V n &#240;d&#222; over a class of regimes d &#8712; D . In cases where D comprises parametric regimes, that is,</p><p>, where B is a compact subset of Euclidean space, one can use stochastic search methods to approximate a solution. Let</p><p>n denote a starting value, for example, one might construct a warm-start b &#946; &#240;0&#222; n using the ad hoc iterative Q-learning procedure described in the preceding section. A simple stochastic gradient descent algorithm uses updates of the form</p><p>for k &#8805; 1, where Z &#240;1&#222; ; Z &#240;2&#222; ; &#8230;, are independent random vectors with mean zero and identity covariance matrix and &#955; &#240;1&#222; ; &#955; &#240;2&#222; ; &#8230;, are (possibly data-dependent) tuning parameters that govern the learning rate <ref type="bibr">(Spall, 2005;</ref><ref type="bibr">Bottou, 2010)</ref>.</p><p>If the class D is comprised of trees or lists, one can use greedy heuristics analogous to those used in classification and regression trees <ref type="bibr">(Loh, 2014;</ref><ref type="bibr">Laber &amp; Zhao, 2015;</ref><ref type="bibr">Zhang et al., 2018)</ref>. For kernel-based decision rules, <ref type="bibr">(Zhao et al., 2015)</ref> use convex surrogates for b</p><p>V n to approximate iterative (stage-wise) and simultaneous (joint) optimal decision rules. For more complex classes of treatment regimes, one can use simulated annealing, genetic algorithms, or other heuristics to approximate a solution <ref type="bibr">(Zhang et al., 2013)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Simulation Studies</head><p>In this section, we investigate the performance of some of the methods described in the preceding section using Monte Carlo simulations. We explore their performance in estimating the optimal treatment regime as well as their performance in estimating the expected outcome under the optimal treatment regime. The simulation setup is similar to <ref type="bibr">Schulte et al. (2014)</ref> and van der Laan &amp; Luedtke (2014); additional details are provided in the next section.</p><p>For the estimation of the optimal treatment regime, we consider the following methods: (1) Q -learning with multiple linear regression (QMR), (2) penalised QMR, (3) Q-learning with random forests (QRF, <ref type="bibr">Taylor et al., 2015)</ref>, (4) A-learning, (5) targeting the blip function (van der Laan &amp; Luedtke, 2014), ( <ref type="formula">6</ref>) targeting the regret function <ref type="bibr">(Murphy, 2003)</ref>, (7) DR-IPCW (van der Laan &amp; Luedtke, 2014), (8) C -learning <ref type="bibr">(Zhang &amp; Zhang, 2018)</ref>, (9) backward outcome weighted learning (BOWL) <ref type="bibr">(Zhao et al., 2015)</ref>, (10) decision lists <ref type="bibr">(Zhang et al., 2018)</ref> and (11) model-based planning. In terms of the three broad categories we introduced previously:</p><p>(1)-( <ref type="formula">6</ref>) are regression-based methods; (7)-( <ref type="formula">10</ref>) are direct/value-search methods; and ( <ref type="formula">11</ref>) is (of course) a model-based planning method. We measure the performance of these estimators using V-efficiency as defined in <ref type="bibr">Schulte et al. (2014)</ref>; that is,</p><p>where b d n denotes the estimated optimal treatment regime constructed from data on n patients. V-efficiency captures the extent to which b d n achieves the value of the true optimal regime. Higher values of R&#240; b d n &#222; indicate better performance. When presenting results, we scale the V-efficiency by 100 to obtain the percentage of &#58628;Y * &#240;d opt &#222; obtained by b d n on average. In addition to V-efficiency, we also calculate the expected potential outcome &#58628;Y * &#240; b d n &#222; and the average agreement with the optimal treatment, that is,</p><p>, for each method.</p><p>We next evaluate several methods for estimating V &#240;d opt &#222;; this quantity is of interest when one wishes to assess the value of applying a personalised treatment strategy in a given domain <ref type="bibr">(Laber et al., 2016;</ref><ref type="bibr">Rose et al., 2019)</ref>. Thus, one must construct an estimator b d n and evaluate it via b</p><p>V n &#240; b d n &#222;. To estimate the expected outcome under the estimated optimal treatment regime, we consider (1) Q-learning with multiple linear regression (QMR), (2) Q-learning with random forests (QRF), (3) A-learning, (4) targeting the regret function, (5) targeted learning (TMLE) (van der <ref type="bibr">Laan &amp; Rubin, 2006)</ref>, and (6) inverse probability treatment weighting (IPTW) (van der Laan &amp; Rose, 2018). For TMLE and IPTW, we first use DR-IPCW to estimate the optimal treatment regime (because of its superior performance in estimating the optimal treatment regime) and then apply these two methods to estimate the expected outcome under the estimated optimal treatment regime. For the measure of performance, we consider the mean squared error</p><p>Lower MSE corresponds to better performance. In Appendix D1, we also show results on regime evaluation in which we compare estimators b V n &#240;d&#222; of V &#240;d&#222; for fixed regimes d.</p><p>Most of the methods considered here require that the Q-function Q k &#240;h k ; a k &#222; be estimated. With the exceptions of QMR and model-based planning, we estimate Q k &#240;h k ; a k &#222; using a random forest (we use the default settings in the randomForest package in R). For QMR, we estimate Q k &#240;h k ; a k &#222; using linear regression on the main effects, that is, h k ; a k , and the first-order interaction terms between treatment and history. We considere L 2 penalisation terms in penalised QMR. For the method of model-based planning, we estimate the Q-and V-functions using Monte Carlo methods (see details in Section 3). We assume that the transition and conditional outcome density follow normal/binomial distributions with linear mean models and constant covariance-matrices which are estimated using maximum likelihood (additional details about the working models are given in Appendix B1). Thus, model-based planning coincides with QMR when K &#188; 1.</p><p>Under the scenarios for which the propensity is unknown, we estimate P&#240;a k jh k &#222; using logistic regression on the main terms of h k . Note that not all methods require the estimation of P&#240;a k jh k &#222;. For methods such as C-learning and BOWL, for which the class of regimes D k must be computationally tractable (e.g. parametric), we consider D k to be a class of linear regimes. For all others, the class of regimes is unrestricted, that is, D k contains all measurable maps from H k into A k ; k &#188; 1; &#8230;; K. We use Q-learning with correctly specified models as the gold standard.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data Generating Models</head><p>We consider the scenarios where continuous and binary outcomes are measured after K &#188; 1; 2; 4; 8 decision points. For each scenario, we take the number of patients to be n &#188; 250; 500; 1000. The available treatments at each decision point are binary. We perform 500 Monte Carlo runs for each scenario.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">Continuous outcome</head><p>First, we consider a continuous outcome in K &#188; 1; 2; 4; 8 decision points. For K &#188; 1; 2, we use the data-generating models shown below, which are similar to those used in <ref type="bibr">Schulte et al. (2014, Section 6)</ref>. The data-generating models for K &#188; 4; 8 are more cumbersome and therefore relegated to Appendix A1.</p><p>In the case of one decision point, suppose that we observe i.i.d. data</p><p>where &#981; 0 1 &#188; &#240;0; &#192;2; 1&#222;; &#946; 0 1 &#188; &#240;1; 1; 5; 1&#222;; &#968; 0 1 &#188; &#240;1; 0:5&#222; and expit&#240;x&#222; &#188; e x =&#240;1 &#254; e x &#222;. In the case of two decision points, suppose we observe i.i.d. data &#240;S 1i ;</p><p>where m&#240;s 1 ; s</p><p>22 s 2 &#222;; &#981; 0 1 &#188; &#240;0:3; 0:5&#222;; &#948; 0 1 &#188; &#240;0; 0:5; &#192;0:75; 0:25&#222;; &#981; 0 2 &#188; &#240;0; 0:5; 0:1; &#192;1; &#192;0:1; 1&#222;; &#946; 0 2 &#188; &#240;3; 0; 0:1; &#192;0:5; &#192;0:5; 5&#222;, and &#968; 0 2 &#188; &#240;1; 0:25; 0:5&#222;. The true value of the expected outcome under the optimal treatment regime (&#968;) for one and two decision points are 10.009 and 16.222, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Binary outcome</head><p>Here, we consider a binary outcome in either one or two decision points using data-generating models similar to those used in van der Laan &amp; Luedtke (2014).</p><p>In the case of one decision point, suppose that we observe i.i.d. data</p><p>In the case of two decision points, suppose that we observe i.i.</p><p>d. data &#240;S 1i ; A 1i ; S 2i ; A 2i ; Y i &#222;; i &#188; 1; &#8230;; n with S 11 ; S 12 &#8764; iid Unif &#240;&#192;1; 1&#222;; A 1 jS 11 ; S 12 &#8764; Bernoulli&#240;0:5&#222;; U 1 ; U 2 jA 1 ; S 11 ; S 12 &#8764; iid Unif &#240;&#192;1; 1&#222;; S 21 jS 11 ; S 12 ; A 1 &#8764; U 1 &#240;1; 25A 1 &#254; 0:25&#222;; S 22 jS 11 ; S 12 ; A 1 ; S 21 &#8764; U 2 &#240;1; 25A 1 &#254; 0:25&#222;; A 2 jS 11 ; S 12 ; A 1 ; S 21 ; S 22 &#8764; Bernoulli&#240;0:5&#222;; Y jS 11 ; S 12 ; A 1 ; S 21 ; S 22 ; A 2 &#8764; Bernoullif0:4 &#254; 0:069b&#240;S 11 ; S 12 ; A 1 ; S 21 ; S 22 ; A 2 &#222;g; where b&#240;S 11 ; S 12 ; A 1 ; S 21 ; S 22 ; A 2 &#222; &#188; 0:5A 1 &#189;&#192;0:8 &#192; 3fsign&#240;S 11 &#222;&#254;S 11 g &#192; S 2 12 &#254;A 1 f&#192;0:35 &#254; &#240;S 21 &#192; 0:5&#222; 2</p><p>The true value of the expected outcome under the optimal treatment regime (&#968;) for one and two decision points are 0.561 and 0.487, respectively. For the regression-based methods, we employ the logistic regression to model the binary outcome and for the model-based planning, we assume the conditional outcome follows a binomial distribution with a mean which is linear in the history.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Simulation Results</head><p>Table <ref type="table">1</ref> displays the values of R&#240; b d n &#222; for each method when the outcome is continuous. In this case, we provide results when the propensity is either known or unknown, thus reflecting data from randomised and observational studies. It can be seen that the non-parametric methods, that is, Blip, DR-IPCW and QRF, typically have better predictive performance compared with (semi-)parametric methods (QMR, Planning, A-learning, C-learning, Regret, BOWL, Decision lists), and their performance is close to the gold standard. The performance of BOWL is superior when K &#188; 1 but degrades significantly when K &#188; 2; 4; 8. Because Q-learning and decision lists do not require estimating the propensity score P&#240;a k jh k &#222;, their results do not depend on whether or not the propensity score is known. It is interesting to note that a few methods (e.g. DR-IPCW for K &#188; 1; 2 ) perform better when the propensity score is unknown (such behaviour is anticipated by known results in semi-parametric theory <ref type="bibr">Tsiatis, 2007)</ref>. Table <ref type="table">2</ref> and 3 show the expected potential outcomes and the percentage of selecting the optimal treatment under various methods, which are consistent with the observation of R&#240; b d n &#222;. Table <ref type="table">4</ref> displays the MSE of the estimates for the expected outcome under the optimal treatment regime when the outcome is continuous. We see that non-parametric methods like Q-learning with random forest (QRF) always outperform (semi-)parametric methods. Moreover, the methods perform better when propensity scores P&#240;a k jh k &#222; are known.</p><p>For binary outcomes, we assume only known propensity scores</p><p>and MSE are presented in Tables <ref type="table">5</ref><ref type="table">6</ref><ref type="table">7</ref><ref type="table">8</ref>, respectively. Similar to the results for continuous outcomes, we can see that the predictive performance of the non-parametric methods is most often superior to that of the (semi-)parametric methods in terms of R&#240; b <ref type="table">1</ref><ref type="table">2</ref><ref type="table">3</ref><ref type="table">4</ref><ref type="table">5</ref><ref type="table">6</ref><ref type="table">7</ref><ref type="table">8</ref>, we summarise our findings in Table <ref type="table">9</ref>, where the methods are ranked based on their performance. We see that non-parametric methods like Blip, DR-IPCW and QRF have higher rankings compared with (semi-)parametric methods when estimating the optimal treatment regime for both when the propensity scores are known and unknown. For the estimated outcome under the optimal regime, QRF consistently outperforms the other methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Real Data Applications</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">HIV Data Analysis</head><p>In this section, we apply several dynamic treatment regime methods to the ACTG175 data set, which is available through the R package speff2trial <ref type="bibr">(Juraska &amp; Juraska, 2010)</ref>. The data set consists of 2139 patients infected with HIV who were randomised to one of four treatment arms: (i) zidovudine (AZT) monotherapy, (ii) AZT+didanosine (ddI), (iii) AZT+zalcitabine (ddC) and (iv) ddI monotherapy. The outcome measure is the CD4 T-cell count at 96 &#177; 5 weeks as CD4 count represents a vital signal for disease progression in HIV-infected patients. A basic</p><p>Table 1. The value of R&#240; b d n &#222; (standard error) under the settings in Section 4.1.1 for observational (Obs.) data type with unknown propensity scores and experimental (Exp.) data type with known propensity scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head><p>Obs. QMR 97.0 (0.1) 94.0 (0.2) 97.0 (0.1) 94.5 (0.1) 97.1 (0.1) 95.1 (2.2) Penalised QMR 97.0 (0.1) 94.0 (0.2) 96.9 (0.1) 94.6 (0.1) 96.8 (0.1) 95.1 (0.1) QRF 98.9 (0.1) 96.2 (0.3) 99.1 (0.1) 97.0 (0.2) 99.1 (0.1) 97.8 (0.1) Planning 97.0 (0.1) 94.0 (0.2) 97.0 (0.1) 94.6 (0.1) 97.1 (0.1) 95.2 (0.1) A-learning 98.4 (0.1) 96.1 (0.3) 98.2 (0.1) 97.1 (0.2) 98.2 (0.1) 98.2 (0.1) C-learning 99.6 (0.1) 87.5 (0.2) 99.7 (0.1) 88.2 (0.1) 99.8 (0.1) 88.8 (0.1) BOWL 99.9 (0.1) 83.9 (0.1) 99.8 (0.1) 83.8 (0.1) 99.9 (0.1) 84.0 (0.1) Regret 94.2 (0.3) 85.4 (0.2) 94.0 (0.2) 85.7 (0.2) 93.6 (0.1) 86.0 (0.1) Blip 99.9 (0.1) 97.1 (0.3) 99.9 (0.1) 98.0 (0.2) 99.9 (0.1) 99.0 (0.1) DR-IPCW 99.9 (0.1) 97.2 (0.3) 99.9 (0.1) 98.2 (0.1) 99.9 (0.1) 98.9 (0.1) Decision list 93.5 (0.1) 85.8 (0.3) 93.5 (0.1) 85.2 (0.1) 93.4 (0.1) 84.5 (0.1) Exp. QMR 97.0 (0.1) 94.0 (0.2) 97.0 (0.1) 94.5 (0.1) 97.1 (0.1) 95.1 (0.1) Penalised QMR 97.0 (0.1) 94.0 (0.2) 96.9 (0.1) 94.6 (0.1) 96.8 (0.1) 95.1 (0.1) QRF 98.9 (0.1) 96.2 (0.3) 99.1 (0.1) 97.0 (0.2) 99.1 (0.1) 97.8 (0.1) Planning 97.0 (0.1) 94.0 (0.2) 97.0 (0.1) 94.6 (0.1) 97.1 (0.1) 95.2 (0.1) A-learning 97.4 (0.1) 92.8 (0.4) 97.9 (0.1) 94.2 (0.2) 98.5 (0.1) 95.5 (0.1) C-learning 98.9 (0.1) 87.0 (0.2) 99.1 (0.1) 87.3 (0.1) 99.4 (0.1) 87.8 (0.1) BOWL 99.9 (0.1) 83.6 (0.1) 99.8 (0.1) 83.6 (0.1) 99.9 (0.1) 83.6 (0.1) Regret 95.0 (0.3) 92.7 (0.4) 94.9 (0.2) 93.0 (0.2) 94.9 (0.1) 92.7 (0.2) Blip 98.5 (0.1) 95.8 (0.3) 99.2 (0.1) 97.3 (0.2) 99.3 (0.1) 98.1 (0.1) DR-IPCW 98.5 (0.1) 96.3 (0.3) 99.3 (0.1) 97.4 (0.2) 99.5 (0.1) 98.1 (0.1) Decision list 93.5 (0.1) 85.8 (0.3) 93.5 (0.1) 85.2 (0.2) 93.4 (0.1) 84.5 (0.1) Q-learning (corr)</p><p>99.8 (0.1) 96.4 (0.3) 99.9 (0.1) 97.6 (0.1) 99.9 (0.1) 98.4 (0.1)</p><p>Obs. QMR 80.8 (3.1) 85.0 (2.2) 84.5 (2.1) 83.3 (1.4) 82.5 (1.4) 80.5 (1.1) Penalised QMR 79.7 (3.0) 82.6 (2.0) 84.1 (2.1) 82.3 (1.4) 82.3 (1.4) 79.8 (1.1) QRF 89.1 (2.9) 93.4 (1.9) 94.9 (1.9) 94.1 (1.1) 93.1 (1.3) 98.0 (0.9) Planning 76.2 (3.1) 75.4 (1.9) 75.7 (2.2) 71.4 (1.4) 70.6 (1.5) 72.4 (1.0) A-learning 73.3 (3.2) 63.5 (2.5) 81.3 (2.2) 63.0 (1.8) 80.0 (1.4) 63.1 (1.3) C-learning 83.7 (3.1) 85.8 (1.9) 88.9 (2.1) 88.7 (1.3) 85.3 (1.5) 90.2 (0.9) BOWL 63.6 (2.0) 58.3 (1.2) 63.0 (1.4) 58.7 (0.8) 63.3 (1.0) 58.8 (0.6) Regret 62.7 (3.2) 67.3 (2.1) 66.8 (2.2) 65.4 (1.4) 64.8 (1.5) 63.9 (1.1) Blip 82.3 (0.2) 80.9 (0.2) 85.7 (0.1) 83.8 (0.1) 90.9 (0.1) 86.7 (0.1) DR-IPCW 82.4 (0.2) 81.0 (0.2) 85.9 (0.1) 83.8 (0.1) 91.0 (0.1) 86.6 (0.1) Decision list 74.8 (2.8) 79.6 (1.9) 71.6 (1.9) 78.7 (1.5) 69.8 (1.4) 75.2 (1.1) Exp. QMR 80.8 (3.1) 85.0 (2.2) 84.5 (2.1) 83.3 (1.4) 82.5 (1.4) 80.5 (1.1) Penalised QMR 79.7 (3.0) 82.6 (2.0) 84.1 (2.1) 82.3 (1.4) 82.3 (1.4) 79.8 (1.1) QRF 89.1 (2.9) 93.4 (1.9) 94.9 (1.9) 94.1 (1.1) 93.1 (1.3) 98.0 (0.9) Planning 76.2 (3.1) 75.4 (1.9) 75.7 (2.2) 71.4 (1.4) 70.6 (1.5) 72.4 (1.0) A-learning 72.3 (3.3) 65.2 (2.5) 79.2 (2.2) 63.4 (1.8) 80.3 (1.5) 64.0 (1.3) C-learning 85.1 (3.0) 84.8 (1.9) 85.5 (2.1) 86.6 (1.4) 91.9 (1.5) 89.5 (1.0) BOWL 63.6 (2.2) 55.9 (1.4) 63.2 (1.5) 55.4 (1.0) 65.0 (1.1) 56.2 (0.7) Regret 62.7 (3.2) 67.3 (2.1) 66.8 (2.2) 65.4 (1.4) 64.8 (1.5) 63.9 (1.1) Blip 82.4 (0.3) 81.7 (0.2) 85.9 (0.1) 83.8 (0.1) 91.0 (0.1) 86.6 (0.1) DR-IPCW 82.5 (0.3) 81.8 (0.2) 85.8 (0.1) 83.9 (0.1) 91.1 (0.1) 86.5 (0.1) Decision list 74.8 (2.8) 79.6 (1.9) 71.6 (1.9) 78.7 (1.5) 69.8 (1.4) 75.2 (1.1) Q-learning (corr) 99.6 (2.8) 99.9 (1.8) 99.9 (1.9) 99.3 (1.2) 99.6 (1.3) 97.8 (0.9)</p><p>conclusion from the study is that for patients who had taken AZT before entering the trial, treatment with ddI or AZT+ddI were better than continuing to take AZT alone (see <ref type="bibr">Hammer et al., 1996,</ref> for more information about the study). In this analysis we expand on this</p><p>Table 2. The potential outcomes under the settings in Section 4.1.1 for observational (Obs.) data type with unknown propensity scores and experimental (Exp.) data type with known propensity scores.</p><p>Obs. QMR 9.7 (0.2) 15.2 (0.5) 9.7 (0.1) 15.3 (0.5) 9.7 (0.1) 15.4 (0.4) Penalised QMR 9.7 (0.2) 15.2 (0.5) 9.7 (0.1) 15.3 (0.5) 9.7 (0.1) 15.4 (0.4) QRF 9.9 (0.1) 15.6 (0.7) 9.9 (0.1) 15.7 (0.5) 9.9 (0.1) 15.9 (0.5) Planning 9.7 (0.2) 15.3 (0.5) 9.7 (0.1) 15.3 (0.4) 9.7 (0.1) 15.4 (0.3) A-learning 9.8 (0.1) 15.6 (0.7) 9.8 (0.1) 15.7 (0.6) 9.8 (0.1) 15.9 (0.5) C-learning 10.0 (0.1) 14.2 (0.4) 10.0 (0.1) 14.3 (0.4) 10.0 (0.1) 14.4 (0.4) BOWL 10.0 (0.1) 13.6 (0.3) 10.0 (0.1) 13.6 (0.3) 10.0 (0.1) 13.6 (0.3) Regret 9.4 (0.4) 13.8 (0.6) 9.4 (0.4) 13.9 (0.6) 9.4 (0.</p><p>4) 14.0 (0.6) Blip 10.0 (0.1) 15.8 (0.6) 10.0 (0.1) 15.9 (0.6) 10.0 (0.1) 16.1 (0.4) DR-IPCW 10.0 (0.1) 15.8 (0.7) 10.0 (0.1) 15.9 (0.5) 10.0 (0.1) 16.1 (0.4) Decision list</p><p>9.4 (0.1) 13.9 (0.6) 9.4 (0.1) 13.8 (0.5) 9.4 (0.1) 13.7 (0.4) Exp.</p><p>QMR 9.7 (0.2) 15.2 (0.5) 9.7 (0.1) 15.3 (0.5) 9.7 (0.1) 15.4 (0.4) Penalised QMR 9.7 (0.2) 15.2 (0.5) 9.7 (0.1) 15.3 (0.5) 9.7 (0.1) 15.4 (0.4) QRF 9.9 (0.1) 15.6 (0.7) 9.9 (0.1) 15.7 (0.5) 9.9 (0.1) 15.9 (0.5) Planning 9.7 (0.2) 15.3 (0.5) 9.7 (0.1) 15.3 (0.4) 9.7 (0.1) 15.4 (0.3) A-learning 9.8 (0.2) 15.1 (0.9) 9.8 (0.2) 15.3 (0.9) 9.9 (0.2) 15.5 (0.7) C-learning 9.9 (0.2) 14.1 (0.4) 9.9 (0.2) 14.2 (0.4) 9.9 (0.2) 14.2 (0.4) BOWL 10.0 (0.1) 13.6 (0.2) 10.0 (0.1) 13.6 (0.2) 10.0 (0.1) 13.6 (0.3) Regret 9.5 (0.4) 15.0 (0.9) 9.5 (0.4) 15.1 (0.9) 9.5 (0.4) 15.0 (0.9) Blip 9.9 (0.2) 15.5 (0.7) 9.9 (0.2) 15.8 (0.6) 9.9 (0.2) 15.9 (0.5) DR-IPCW 9.9 (0.2) 15.6 (0.7) 9.9 (0.2) 15.8 (0.6) 10.0 (0.1) 15.9 (0.5) Decision list 9.4 (0.1) 13.9 (0.6) 9.4 (0.1) 13.8 (0.5) 9.4 (0.1) 13.7 (0.4) Q-learning (corr) 10.0 (0.1) 15.6 (0.7) 10.0 (0.1) 15.8 (0.5) 10.0 (0.1) 16.0 (0.4)</p><p>Obs. QMR 4.2 (2.5) 8.7 (3.5) 4.4 (2.4) 8.5 (3.3) 4.3 (2.3) 8.2 (3.5) Penalised QMR 4.1 (2.5) 8.4 (3.3) 4.3 (2.4) 8.4 (3.2) 4.2 (2.3) 8.1 (3.4) QRF 4.6 (2.4) 9.5 (3.0) 4.9 (2.2) 9.6 (2.6) 4.8 (2.2) 10.0 (2.9) Planning 3.9 (2.5) 7.7 (3.1) 3.9 (2.6) 7.3 (3.2) 3.6 (2.4) 7.4 (3.3) A-learning 3.8 (2.6) 6.5 (4.1) 4.2 (2.5) 6.4 (4.1) 4.1 (2.3) 6.4 (4.3) C-learning 4.3 (2.5) 8.8 (3.0) 4.6 (2.4) 9.1 (3.1) 4.4 (2.5) 9.2 (3.0) BOWL 3.3 (1.6) 6.0 (2.0) 3.2 (1.6) 6.0 (1.9) 3.3 (1.7) 6.0 (2.1) Regret 3.2 (2.6) 6.9 (3.4) 3.4 (2.5) 6.7 (3.3) 3.3 (2.5) 6.5 (3.5) Blip 4.2 (0.2) 8.3 (0.3) 4.4 (0.2) 8.6 (0.2) 4.7 (0.1) 8.9 (0.2) DR-IPCW 4.2 (0.2) 8.3 (0.2) 4.4 (0.2) 8.6 (0.2) 4.7 (0.1) 8.9 (0.2) Decision list 3.9 (2.3) 8.1 (3.1) 3.7 (2.2) 8.0 (3.3) 3.6 (2.4) 7.7 (3.6) Exp. QMR 4.2 (2.5) 8.7 (3.5) 4.4 (2.4) 8.5 (3.3) 4.3 (2.3) 8.2 (3.5) Penalised QMR 4.1 (2.5) 8.4 (3.3) 4.3 (2.4) 8.4 (3.2) 4.2 (2.3) 8.1 (3.4) QRF 4.6 (2.4) 9.5 (3.0) 4.9 (2.2) 9.6 (2.6) 4.8 (2.2) 10.0 (2.9) Planning 3.9 (2.5) 7.7 (3.1) 3.9 (2.6) 7.3 (3.2) 3.6 (2.4) 7.4 (3.3) A-learning 3.7 (2.7) 6.7 (4.0) 4.1 (2.6) 6.5 (4.1) 4.1 (2.4) 6.5 (4.1) C-learning 4.4 (2.4) 8.7 (3.0) 4.4 (2.4) 8.8 (3.2) 4.7 (2.4) 9.1 (3.2) BOWL 3.3 (1.8) 5.7 (2.3) 3.3 (1.7) 5.6 (2.4) 3.3 (1.8) 5.7 (2.4) Regret 3.2 (2.6) 6.9 (3.4) 3.4 (2.5) 6.7 (3.3) 3.3 (2.5) 6.5 (3.5) Blip 4.2 (0.2) 8.3 (0.3) 4.4 (0.2) 8.6 (0.2) 4.7 (0.1) 8.8 (0.2) DR-IPCW 4.2 (0.2) 8.3 (0.2) 4.4 (0.2) 8.6 (0.2) 4.7 (0.1) 8.8 (0.2) Decision list 3.9 (2.3) 8.1 (3.1) 3.7 (2.2) 8.0 (3.3) 3.6 (2.4) 7.7 (3.6) Q-learning (corr) 5.1 (2.3) 10.3 (2.8) 5.3 (2.2) 10.1 (2.7) 5.1 (2.2) 10.0 (2.9)</p><p>conclusion seeking to determine the optimal assignment of these two treatments for the subset of patients that has previously taken AZT. Specifically, we select n &#188; 532 patients who had taken AZT before the study and received AZT+ddI or ddI monotherapy in the study and for</p><p>Table 3. The percentage of correct decisions (standard error) under the settings in Section 4.1.1 for observational (Obs.) data type with unknown propensity scores and experimental (Exp.) data type with known propensity scores.</p><p>Data</p><p>Obs. QMR 80.1 (0.5) 68.6 (0.4) 80.6 (0.3) 69.3 (0.3) 80.5 (0.1) 70.2 (0.1) Penalised QMR 79.9 (0.5) 68.7 (0.4) 80.0 (0.3) 69.5 (0.2) 80.1 (0.1) 70.1 (0.1) QRF 92.1 (0.2) 80.5 (0.9) 92.8 (0.1) 83.2 (0.6) 92.8 (0.1) 85.4 (0.4) Planning 80.1 (0.5) 68.5 (0.4) 80.6 (0.3) 69.3 (0.2) 80.5 (0.1) 70.1 (0.1) A-learning 70.2 (0.6) 77.4 (1.0) 68.4 (0.3) 80.9 (0.7) 67.8 (0.1) 84.7 (0.4) C-learning 95.6 (0.2) 53.5 (0.7) 96.6 (0.1) 56.1 (0.5) 97.1 (0.1) 59.1 (0.3) BOWL 97.6 (0.1) 33.2 (0.3) 97.6 (0.1) 33.0 (0.2) 97.6 (0.1) 33.7 (0.2) Regret 50.8 (2.3) 37.5 (0.6) 50.0 (1.5) 38.3 (0.4) 47.7 (1.4) 38.1 (0.2) Blip 97.6 (0.1) 84.5 (0.8) 97.5 (0.1) 88.2 (0.5) 97.7 (0.1) 91.4 (0.3) DR-IPCW 97.6 (0.1) 84.2 (0.9) 97.5 (0.1) 88.7 (0.5) 97.7 (0.1) 91.7 (0.3) Decision list 27.0 (0.2) 31.8 (0.5) 27.0 (0.1) 30.4 (0.3) 27.0 (0.1) 28.7 (0.1) Exp. QMR 80.1 (0.5) 68.6 (0.4) 80.6 (0.3) 69.3 (0.3) 80.5 (0.1) 70.2 (0.1) Penalised QMR 79.9 (0.5) 68.7 (0.4) 80.0 (0.3) 69.5 (0.2) 80.1 (0.1) 70.1 (0.1) QRF 92.1 (0.2) 80.5 (0.9) 92.8 (0.1) 83.2 (0.6) 92.8 (0.1) 85.4 (0.4) Planning 80.1 (0.5) 68.5 (0.4) 80.6 (0.3) 69.3 (0.2) 80.5 (0.1) 70.1 (0.1) A-learning 72.4 (1.0) 61.6 (1.3) 76.6 (0.7) 67.1 (0.9) 81.0 (0.4) 72.1 (0.6) C-learning 89.4 (0.9) 47.8 (0.8) 91.5 (0.6) 50.8 (0.5) 93.3 (0.3) 53.4 (0.3) BOWL 97.6 (0.1) 32.5 (0.3) 97.6 (0.1) 32.5 (0.2) 97.6 (0.1) 32.9 (0.1) Regret 51.5 (2.3) 70.8 (1.1) 50.9 (1.7) 72.0 (0.8) 50.6 (1.2) 71.1 (0.5) Blip 86.8 (0.9) 76.7 (1.0) 91.2 (0.6) 83.8 (0.6) 93.6 (0.4) 87.2 (0.4) DR-IPCW 87.6 (0.7) 77.6 (1.0) 92.2 (0.4) 83.7 (0.9) 94.8 (0.2) 87.8 (0.4) Decision list 27.0 (0.2) 31.8 (0.5) 27.0 (0.1) 30.4 (0.3) 27.0 (0.1) 28.7 (0.1) Q-learning (corr) 95.7 (0.3) 79.0 (1.0) 97.3 (0.1) 83.5 (0.6) 98.1 (0.1) 87.3 (0.4)</p><p>Obs. QMR 66.9 (0.</p><p>3) 70.6 (0.4) 67.7 (0.2) 69.3 (0.2) 68.1 (0.1) 67.3 (0.1) Penalised QMR 65.4 (0.3) 64.1 (0.4) 66.8 (0.2) 65.4 (0.2) 67.7 (0.1) 65.3 (0.1) QRF 85.1 (0.3) 86.2 (0.2) 90.0 (0.1) 90.7 (0.1) 93.3 (0.1) 93.9 (0.1) Planning 60.4 (0.1) 57.8 (0.1) 60.6 (0.1) 57.9 (0.1) 60.7 (0.1) 57.9 (0.1) A-learning 61.9 (0.6) 50.2 (0.4) 65.7 (0.3) 52.3 (0.3) 67.1 (0.2) 54.5 (0.2) C-learning 73.6 (0.3) 72.0 (0.3) 76.5 (0.2) 76.3 (0.2) 78.7 (0.1) 79.5 (0.1) BOWL 46.1 (0.1) 36.9 (0.2) 46.3 (0.1) 37.9 (0.1) 46.5 (0.1) 38.6 (0.1) Regret 49.8 (0.7) 50.1 (0.6) 49.8 (0.5) 50.1 (0.4) 49.7 (0.3) 50.1 (0.3) Blip 72.0 (0.4) 66.8 (0.3) 76.5 (0.2) 70.7 (0.2) 84.2 (0.2) 75.0 (0.2) DR-IPCW 71.8 (0.4) 67.3 (0.3) 77.5 (0.2) 70.9 (0.2) 83.7 (0.2) 74.6 (0.2) Decision list 58.8 (0.3) 57.1 (0.4) 57.2 (0.3) 58.5 (0.3) 55.2 (0.2) 59.5 (0.2) Exp. QMR 66.9 (0.3) 70.6 (0.4) 67.7 (0.2) 69.3 (0.2) 68.1 (0.1) 67.3 (0.1) Penalised QMR 65.4 (0.3) 64.1 (0.4) 66.8 (0.2) 65.4 (0.2) 67.7 (0.1) 65.3 (0.1) QRF 85.1 (0.3) 86.2 (0.2) 90.0 (0.1) 90.7 (0.1) 93.3 (0.1) 93.9 (0.1) Planning 60.4 (0.1) 57.8 (0.1) 60.6 (0.1) 57.9 (0.1) 60.7 (0.1) 57.9 (0.1) A-learning 59.8 (0.6) 50.5 (0.4) 64.3 (0.3) 52.0 (0.3) 67.4 (0.2) 53.9 (0.2) C-learning 73.5 (0.3) 71.8 (0.3) 76.6 (0.2) 76.3 (0.2) 78.7 (0.1) 79.6 (0.1) BOWL 36.5 (0.3) 32.5 (0.3) 37.3 (0.2) 32.5 (0.2) 38.1 (0.2) 32.9 (0.2) Regret 49.8 (0.7) 50.1 (0.6) 49.8 (0.5) 50.1 (0.4) 49.7 (0.3) 50.1 (0.3) Blip 72.1 (0.4) 67.8 (0.3) 76.9 (0.2) 70.4 (0.2) 84.5 (0.2) 74.7 (0.2) DR-IPCW 71.7 (0.4) 68.2 (0.3) 77.4 (0.2) 71.0 (0.2) 84.1 (0.2) 75.2 (0.2) Decision list 58.8 (0.3) 57.1 (0.4) 57.2 (0.3) 58.5 (0.3) 55.2 (0.2) 59.5 (0.2) Q-learning (corr) 100 (0) 100 (0) 100 (0) 100 (0) 100 (0) 100 (0)</p><p>whom full CD4 and the selected covariates are available. This problem can be formulated as a one-decision-point problem for which the treatment indicator A i is set to be 1 if patient i is assigned to AZT+ddI therapy and 0 if the patient is assigned to ddI monotherapy. Because the trial is randomised with equal randomisation for both therapies, the propensity score equals 0.5. For illustrative purposes, we consider only two covariates in the estimation of the optimal treatment regimes: the baseline body weight S 1 and the baseline CD4 T-cell count S 2 . We include the body weight in our analysis because it has been observed that body weight has a significant role on AZT pharmacokinetic profile. <ref type="bibr">Burger et al. (1994)</ref> reported that AZT clearance </p><p>Obs. QMR 1.93 (0.08) 41.75 (2.28) 1.51 (0.04) 33.64 (1.49) 1.53 (0.04) 28.92 (0.93) QRF 0.28 (0.02) 1.58 (0.12) 0.15 (0.01) 0.80 (0.06) 0.08 (0.01) 0.45 (0.03) A-learning 18.77 (0.85) 66.61 (5.67) 16.50 (0.49) 38.75 (2.64) 16.48 (0.35) 27.46 (1.71) Regret 2.80 (0.18) 11.03 (0.53) 2.61 (0.15) 10.57 (0.40) 2.41 (0.14) 10.68 (0.38) TMLE 1.64 (0.18) 9.19 (0.39) 0.91 (0.03) 5.19 (0.22) 0.80 (0.02) 4.25 (0.14) IPTW 1.93 (0.20) 10.37 (0.48) 1.15 (0.04) 5.43 (0.23) 1.12 (0.03) 4.46 (0.15) Exp. QMR 1.93 (0.08) 41.75 (2.28) 1.51 (0.04) 33.64 (1.49) 1.53 (0.04) 28.92 (0.93) QRF 0.28 (0.02) 1.58 (0.12) 0.15 (0.01) 0.80 (0.06) 0.08 (0.01) 0.45 (0.03) A-learning 2.09 (0.28) 62.28 (5.46) 1.18 (0.14) 22.14 (2.40) 0.51 (0.05) 9.06 (0.72) Regret 25.71 (0.27) 51.54 (1.14) 25.94 (0.20) 56.39 (1.18) 26.38 (0.15) 31.48 (0.34) TMLE 1.44 (0.07) 8.71 (0.48) 0.85 (0.04) 5.11 (0.21) 0.73 (0.02) 4.35 (0.18) IPTW 1.87 (0.10) 9.91 (0.59) 1.10 (0.05) 6.03 (0.28) 0.98 (0.03) 5.66 (0.35) Q-learning (corr) 0.22 (0.01) 2.92 (0.19) 2.61 (0.15) 1.51 (0.09) 2.41 (0.14) 0.69 (0.04) 'corr' represents 'correctly specified model'. The standard errors are shown in parentheses. The boldfaced-values indicate the best results among different methods. </p><p>85.1 (0.2) 93.2 (0.3) 85.8 (0.1) 95.4 (0.1) 86.4 (0.1) 96.8 (0.1) Penalised QMR 80.0 (0.2) 92.5 (0.3) 79.4 (0.1) 94.7 (0.1) 78.7 (0.1) 96.2 (0.1) QRF 92.1 (0.2) 89.9 (0.3) 94.2 (0.1) 93.2 (0.1) 96.0 (0.1) 95.7 (0.1) Planning 85.1 (0.2) 93.2 (0.3) 85.8 (0.1) 95.4 (0.1) 86.4 (0.1) 96.8 (0.1) A-learning 85.0 (0.2) 88.9 (0.4) 85.7 (0.1) 90.8 (0.2) 86.3 (0.1) 92.6 (0.1) C-learning 87.4 (0.2) 92.4 (0.3) 88.6 (0.1) 94.6 (0.1) 89.8 (0.1) 96.3 (0.1) BOWL 79.5 (0.2) 94.9 (0.4) 78.9 (0.1) 96.9 (0.1) 78.3 (0.1) 97.8 (0.1) Regret 82.6 (0.2) 82.1 (0.1) 82.7 (0.1) 82.1 (0.1) 82.7 (0.1) 82.2 (0.1) Blip 88.2 (0.2) 90.7 (0.4) 91.2 (0.1) 93.3 (0.2) 95.3 (0.1) 96.0 (0.1) DR-IPCW 88.2 (0.2) 91.2 (0.4) 91.3 (0.1) 94.1 (0.2) 95.4 (0.1) 96.7 (0.1) Decision list 84.2 (0.2) 80.5 (0.1) 85.1 (0.1) 80.4 (0.1) 84.8 (0.1) 80.5 (0.1) The boldfaced-values indicate the best results among different methods.</p><p>is significantly lower in patients with a lower body weight, which indicates a qualitative interaction of body weight with AZT. 2  We apply the methods discussed previously in Section 4 to estimate the optimal treatment regime. Moreover, we provide IPW estimators given by van der Laan &amp; Rose (2018) for the mean outcome and associated 95% confidence intervals under each estimated policy. For training and testing purposes, we randomly split the data set into a training set with 300 patients and a testing set with 232 patients. Then we estimate the optimal treatment regime using the training set and give IPW estimators and confidence intervals for the mean outcome under the estimated optimal treatment regime, for which the results are provided in Table <ref type="table">10</ref>. It can be seen that the estimators of the mean outcome under the optimal treatment regime estimated by QMR and A-learning are the largest; however, the confidence intervals do not show a significant difference among all the methods. This result implies a larger sample size is needed to test which methods outperform others. Because we posit the same parametric model for QMR and A-learning, the estimators are the same for these two methods.</p><p>Recall that non-parametric methods such as QRF, Blip and DR-IPCW do not provide explicit forms for the estimated optimal treatment regime, whereas the parametric methods do. The interpretability of decision rules obtained from parametric methods are seen by many researchers as an advantage over the black-box decision rules generated by non-parametric methods, especially in the field of precision medicine. Although explicit forms cannot generally be extracted   </p><p>003 (0.0002) 0.004 (0.0002) 0.004 (0.0001) 0.001 (0.0001) 0.004 (0.0001) 0.0006 (0.0001) QRF 0.003 (0.0001) 0.001 (0.0001) 0.002 (0.0001) 0.0006 (0.0001) 0.002 (0.0001) 0.0004 (0.0001) A-learning 0.003 (0.0002) 0.038 (0.0002) 0.004 (0.0001) 0.014 (0.001) 0.005 (0.0001) 0.005 (0.0003) Regret 0.011 (0.0003) 0.012 (0.0005) 0.013 (0.0002) 0.012 (0.0004) 0.014 (0.0002) 0.010 (0.0004) TMLE 0.013 (0.0005) 0.040 (0.002) 0.006 (0.0003) 0.017 (0.001) 0.007 (0.0001) 0.003 (0.0001) IPTW 0.018 (0.0006) 0.042 (0.002) 0.008 (0.0003) 0.018 (0.001) 0.009 (0.0001) 0.003 (0.0002) The bold-faced-values indicate the best results among different methods.  from non-parametric methods, we illustrate how these rules depend on the selected covariates in Figure <ref type="figure">1</ref>, which displays the original and optimal treatment assignment using QRF for the 232 patients in the testing set.</p><p>Combining Table <ref type="table">10</ref> and Figure <ref type="figure">1</ref>, we can see that for both parametric and non-parametric methods, the estimated optimal treatment regime tends to recommend AZT+ddI combination therapy for patients with large body weight and small baseline CD4 count; otherwise, ddI monotherapy is recommended. The level of agreement across methods can be seen in Table <ref type="table">11</ref> and Figure <ref type="figure">1</ref> where we compare the recommended treatment for each patient of the testing set. Although the recommended treatment is not always consistent across the methods, the treatment recommended by a parametric method may be preferred because of interpretability of the results due to known roles of body weight and baseline CD4 count in disease progression, which can be easily communicated with physicians in medical practice. However, when domain knowledge about disease progression is limited, non-parametric methods can be desirable to provide guidance for the treating physician in choosing the optimal treatment in order to achieve better prediction accuracy.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Melanoma Data Analysis</head><p>Melanoma is the fifth most common malignant tumour among men and the sixth most common malignant tumour among women, with a total of 6850 deaths in the US in 2020 (ACS, 2020). It is currently the deadliest form of skin cancer with a median survival time of approximately 24 months when diagnosed and treated in the earlier stages <ref type="bibr">(Song et al., 2015)</ref>. Typically, treatment includes a sequence of therapies (such as chemotherapies and immunotherapies) that are selected based on the patient's baseline and time-evolving covariates (including disease progression). Here, we apply some of the methods described in Section 3 to a real-world dataset comprising 7730 patients diagnosed with melanoma. We seek to estimate an optimal treatment regime that maximises the mean survival rate at two years after a patient starts a first-line treatment.</p><p>The dataset was derived from the Flatiron health database. Each patient received the first-line of therapy during January 2011 and May 2017 and continued to receive multiple lines of treatment over multiple years. In this analysis, we consider only five of the major therapies available at the time and encode them as A, B, C, D, and E (to maintain confidentiality, the names of the treatment are not disclosed). Based on domain knowledge and a common variable selection method (SIS-Lasso) <ref type="bibr">(Fan &amp; Lv, 2008)</ref>, we select the following baseline and time-dependent covariates: age, gender, Tstage (tumour size), Mstage (metastatic status), Nstage (lymph nodes), Gstage (tumour grade), BRAF (a human gene that encodes a protein called B-Raf that helps to control cell growth) and protein in serum or plasma. The subset of patients used in this analysis includes the 781 patients that received one of the selected treatment and for whom complete data are available.</p><p>To formalise this problem in the context of dynamic treatment regime, we consider 8 decision points over 2 years (3-month interval) and 6 treatment options at each decision point: A, B, C, D, E, and no-treatment. Because there are multiple treatment options available at each decision point, we only consider estimating the optimal treatment regime using Q-learning with multiple regression (QMR), Q-learning with random forest (QRF) and compare these to a method based on Nelson-Aalen estimators commonly used in the survival analysis literature <ref type="bibr">(Shen et al., 2017)</ref>. We find that the optimal dynamic treatment regime estimated using the method in <ref type="bibr">Shen et al. (2017)</ref> selects 'no-treatment' for the majority of decision points, which does not have clinical value. Although the precise reason for this unsatisfactory result is not clear, it might be due to the misspecified models for system dynamics or the class of candidate regimes might be unsuitable. The estimated mean survival rate under the optimal treatment regime for QMR and QRF are 0.726 and 0.596, respectively, with the empirical survival rate 0.410. According to the simulation performance and domain knowledge, the estimate for QMR is biased upwards while that for QRF is relatively accurate.</p><p>To study which treatment is optimal for patients at different decision points, we separate the data sets into a training set with size of 500 and a testing set with size of 281. We estimate the optimal treatment regime with the training set and compute the optimal treatment for patients in the testing set. Table <ref type="table">12</ref> shows the frequency (as a proportion) with which each treatment option was recommended by QRF and QMR at all available decision points for the 281 patients in the testing set. We see that these two methods most often select treatment A as the optimal treatment, whereas treatment E, the most frequently selected treatment in the data, is less likely to be recommended. Figure <ref type="figure">2</ref> shows the original and optimal treatment assignments applying QRF and QMR for the 281 patients in the testing set. The regime estimated under QMR switches treatment more frequently than that of QRF. However, the QMR regime is more interpretable in that the influence of prior treatments and patient covariates on the decision rule can be easily identifiable, whereas QRF yields a black-box decision making mechanism that lacks clear clinical interpretation.</p><p>Unlike the HIV example in Section 5.1 where there is only one decision point and the treatment is recommended based on baseline body weight and CD4 count, in this example there are 8 decision points and the treatment at each decision point is recommended based on prior history (e.g. prior treatments received and other evolving covariates) of individual patients, leading to the complexity of decision-making process. Therefore, interpretability of the results is highly desirable in order to communicate with the treating physician for the choice of right therapy during the treatment. In this sense, QMR is always preferred.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Summary and Appraisal of the Methods</head><p>In this review, we introduced and evaluated a number of the most widely used estimators of an optimal treatment regime and the expected outcome under the optimal treatment regime. As might have been expected, no single method was uniformly best, we discuss here the pros and cons of each method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Predictive Performance</head><p>In settings with strong prior knowledge that can be used to inform a high-quality system dynamics mode, model-based planning can offer excellent predictive performance while still being  meaningful (interpretable) to domain experts. However, when such information is not available one is often faced with a potentially difficult trade-off between performance and interpetability/parsimony. Our simulation results showed that simple parametric models can be sensitive to misspecification while non-parametric models (esp., targeting the blip function, DR-IPCW, and random forests) provided good performance across the range of examples considered. While this point is often emphasised by proponents of non-parametric methods, it should be noted that the model diagnostics and interactive model-building that are carried out in practice are likely to catch severe misspecification thus weakening these criticisms. Furthermore, as estimation of an optimal regime is often carried out as part of exploratory and hypothesis-generating analyses, it is critical that the estimated optimal regime be interpretable to clinical and intervention scientists.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Model Constraints</head><p>Even if the true optimal regime is a complex function of patient covariates, there is often interest in estimating an optimal regime within a pre-specified class of regimes, for example, the class or linear regimes or those representable as trees etc. <ref type="bibr">(Zhang et al., 2012;</ref><ref type="bibr">Zhang et al., 2013;</ref><ref type="bibr">Zhang et al., 2018)</ref>. In some cases one can estimate a regime within the class by restricting the form of the Q-function or contrast although this may risk misspecification. A potentially more robust solution is to use direct-search methods such as C-learning or model-based planning. Such methods are also amenable to constraints on cost, risk of harm/adverse events, and local availability as these constraints can be implemented through constraints on the class of regimes or by augmenting the value function <ref type="bibr">(Linn et al., 2015;</ref><ref type="bibr">Luedtke &amp; van der Laan, 2016c;</ref><ref type="bibr">Laber et al., 2018;</ref><ref type="bibr">Wang et al., 2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Interpretability</head><p>As we have noted previously, interpetability is critical in when optimal dynamic treatment regimes are being used to generate new clinical insights and guide subsequent research. Clinicians are often unwilling to inform clinical decision making using an unintelligble black box. Furthermore, there is strong empirical and theoretical evidence to suggest the existence of effective but parsimonious treatment regimes <ref type="bibr">(Zhang et al., 2015;</ref><ref type="bibr">Wang et al., 2018;</ref><ref type="bibr">Rudin, 2019)</ref>. However, in online learning, problems where decision rules are being used to select patient interventions in real-time performance (subject to safety constraints) may be paramount. In such settings, the use of more flexible models may be justified.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Other Considerations</head><p>In an era of increasingly large and complex data types issues such as extensibility and computational scalability are becoming important factors in selection of a method for estimating an optimal intervention strategy. Q-learning is among the most versatile methods in that it applies in virtually any setting in which regression can be applied; for example, recent examples include the use of functional predictors <ref type="bibr">(McKeague &amp; Qian, 2014;</ref><ref type="bibr">Laber &amp; Staicu, 2018;</ref><ref type="bibr">Ciarleglio et al., 2018;</ref><ref type="bibr">Dziak et al., 2019)</ref>, and ordinal treatments <ref type="bibr">(Chen et al., 2018)</ref>.</p><p>Another concern is the efficiency and robustness of accompanying inference procedures. It is well-established that the value function is a non-smooth functional of the data-generating model and thus confidence intervals and tests for the value function are non-regular and may not perform well without adaptation. <ref type="bibr">(Robins, 2004;</ref><ref type="bibr">Chakraborty et al., 2010;</ref><ref type="bibr">Moodie &amp; Richardson, 2010;</ref><ref type="bibr">Laber et al., 2014;</ref><ref type="bibr">Chakraborty et al., 2014;</ref><ref type="bibr">Song et al., 2015;</ref><ref type="bibr">Luedtke &amp; Van Der Laan, 2016a)</ref>. Valid inference procedures for many parametric approaches exist, for example, A-/Q-learning or BOWL with parametric decision rules, but work on inference for non-parametric/machine learning methods is an ongoing and active area of research (see <ref type="bibr">Chernozhukov et al., 2017;</ref><ref type="bibr">Naimi &amp; Kennedy, 2017;</ref><ref type="bibr">Shi et al., 2020, and references therein)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this review we sought to provide an accessible introduction to a variety of commonly used methods for estimation of an optimal treatment regime along with an empirical evaluation of these methods. The results of our empirical experiments illustrated trade-offs between interpretability and performance and showed that there is no method that is likely to work best in all settings. Estimation and inference for optimal treatment regimes remains an active and exciting area of research. We hope that this review might serve as a bridge for researchers to engage with this important area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Endnotes</head><p>1 The expectation here is taken over the data-generating distribution. If the class D K contains the function d opt K &#240;h K &#222; &#188; argmax a K Q K &#240;h K ; a K &#222; the data-generating distribution does not matter in the sense that d opt K maximises d K &#8614;&#58628;Q K H K ; d K &#240;H K &#222; f gover any distribution on H K . Otherwise, the argmax over D K may depend on this distribution and one may wish to consider globally optimising over all of d as described later in the next section. A similar argument applies for each k &#188; 1; &#8230;; K &#192; 1. Optimising sequentially need not find the globally optimal regime, or even the optimal regime in D, unless d opt k is in D k for all k. See <ref type="bibr">Tsiatis et al. (2019)</ref> for additional discussion.</p><p>2 In medicine, drug clearance is a pharmacokinetic measurement of the rate at which the active drug is removed from the body and drug clearance is correlated with the time course of a drug's action.</p><p>Table D1. Comparisons across different regime evaluation methods under the data-generating settings in Section 4.1.1. Trt. MSE (n &#188; 250) MSE (n &#188; 500) MSE (n &#188; 1000) regime Method K &#188; 1 K &#188; 2 K &#188; 1 K &#188; 2 K &#188; 1 K &#188; 2 d opt TMLE 0.88 (0.31) 6.36 (0.25) 0.23 (0.01) 5.79 (0.19) 0.13 (0.01) 5.12 (0.13) IPTW 1.22 (0.32) 4.48 (0.21) 0.51 (0.03) 3.47 (0.15) 0.42 (0.02) 2.75 (0.10) d const TMLE 1.73 (0.32) 4.59 (0.29) 0.95 (0.04) 3.55 (0.17) 0.75 (0.02) 3.29 (0.13) IPTW 1.97 (0.32) 4.80 (0.30) 1.20 (0.04) 3.73 (0.18) 1.07 (0.03) 3.49 (0.14) d random TMLE 0.81 (0.17) 11.27 (0.43) 0.45 (0.02) 10.78 (0.30) 0.41 (0.02) 10.52 (0.24) IPTW 0.94 (0.23) 10.69 (0.42) 0.34 (0.02) 15.64 (0.43) 0.22 (0.01) 25.06 (0.45)</p><p>The propensity score is unknown here and is estimated using super learning. The MSE (standard error) is calculated for three treatment regimes d opt ; d const and d random .</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>International Statistical Review (2023) &#169; 2023 International Statistical Institute. 17515823, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12536 by Duke University Libraries, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>17515823, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/insr.12536 by Duke University Libraries, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>'corr' represents 'correctly specified model'. Standard errors of the estimators are shown in parenthesis. The boldfaced-values represent the best results among different methods.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>'corr' represents 'correctly specified model'. Standard errors of the potential outcomes are shown in parenthesis. The boldfaced-values represent the best results among different methods.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>'corr' represents 'correctly specified model'. Standard errors of the estimators are shown in parentheses. The boldfaced-values represent the best results among different methods.</p></note>
		</body>
		</text>
</TEI>
