<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Optimal Stopping of Adaptive Dose-Finding Trials</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10209407</idno>
					<idno type="doi">10.1287/serv.2020.0261</idno>
					<title level='j'>Service Science</title>
<idno>2164-3962</idno>
<biblScope unit="volume">12</biblScope>
<biblScope unit="issue">2-3</biblScope>					

					<author>Amir Ali Nasrollahzadeh</author><author>Amin Khademi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The primary objective of this paper is to develop computationally efficient methods for optimal stopping of an adaptive Phase II dose-finding clinical trial, where the decision maker may terminate the trial for efficacy or abandon it as a result of futility. We develop two solution methods and compare them in terms of computational time and several performance metrics such as the probability of correct stopping decision. One proposed method is an application of the one-step look-ahead policy to this problem. The second proposal builds a diffusion approximation to the state variable in the continuous regime and approximates the trial’s stopping time by optimal stopping of a diffusion process. The secondary objective of the paper is to compare these methods on different dose-response curves, particularly when the true dose-response curve has no significant advantage over a placebo. Our results, which include a real clinical trial case study, show that look-ahead policies perform poorly in terms of the probability of correct decision in this setting, whereas our diffusion approximation method provides robust solutions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The cost of inventing, developing, and introducing a new drug to market has surpassed $2.6 billion (Tufts Center for the Study of Drug Development 2014). The biggest drivers of this high cost are clinical trials, the cost of which depends on several factors such as the number of participants, locations of research facilities, and complexity of the trial protocol <ref type="bibr">(Roy 2012)</ref>. In fact, the total cost can reach $300-$600 million for large clinical trials <ref type="bibr">(Griffin et al. 2010)</ref>. The main goal of dose-finding clinical trials is to identify a "target dose," which is used in later stages with more patients to confirm its effects and is considered a critical step in the drug development process <ref type="bibr">(Bornkamp et al. 2007)</ref>. This is because a poor selection of the target dose may cause the Food and Drug Administration (FDA) to disapprove Phase III, the next phase, which is the most costly phase in drug development, as a result of insignificant positive evidence (futility; see <ref type="bibr">Snapinn et al. (2006)</ref>) or exposure to unnecessary risk (adverse effects). In fact, during 2000-2012, failure to select optimal drug doses was a leading factor for the delay or denial of drug submissions in the first submission round by the FDA <ref type="bibr">(Sacks et al. 2014</ref>). In particular, several studies have shown that more than 50% of Phase III clinical trials fail to establish the efficacious benefits of the treatment (see, e.g., <ref type="bibr">Grignolo and Pretorius (2016)</ref>). This high attrition rate is of particular concern because it demonstrates that more than half of the trials that clear Phase II by indicating a positive evidence of efficacy fail to produce similar outcomes in Phase III. <ref type="bibr">Grignolo and Pretorius (2016)</ref> suggested that inadequate Phase II studies and suboptimal dose selection policies are among the major factors contributing to this high failure rate. Considering the high attrition rate and costs of Phase III clinical trials motivates a new approach that is able to detect futility in earlier stages (Phase II) and provide a reliable assurance for efficacious results. Early futility detection in Phase II reduces the cost and risk of exposure to futile treatments, whereas an assurance for efficacious results may prevent the high attrition rate in Phase III. Therefore, formulating an optimal stopping problem capable of making optimal stopping decisions for efficacy or futility while adapting to different treatments and patient assignment procedures has the potential to impact the drug development process in practice. Compared with classic static designs of clinical trials, adaptive designs generally can reduce the cost of conducting a clinical trial because of their ability to modify the design while the trial is still in progress. This enables an adaptive clinical trial to abandon an ineffective treatment early in the process <ref type="bibr">(Berry et al. 2002)</ref>, and thus the optimal stopping of adaptive clinical trials is naturally motivated.</p><p>We formulate the optimal stopping of an adaptive dose-finding Phase II trial as a finite-horizon stochastic dynamic program (DP), where at each intermediate decision epoch, the decision maker (DM) may abandon the trial as a result of futility, continue the trial to collect more evidence about the doseresponse curve, or terminate the trial for efficacy and move to the confirmatory phase. The dose-finding trial is assumed to investigate the efficacy of multiple doses (&#8805; 2) with normally distributed patient responses with an unknown mean and a known observation variance. Considering that the main goal of a Phase II trial is to identify a target dose for which the efficacy should be tested versus a standard treatment or placebo in a large patient population in Phase III, a key feature in our formulation is the incorporation of an approximation of the probability of success at the end of a confirmatory phase (Phase III) when deciding to terminate the trial. Before further discussion on this key feature, note that we decouple the dose allocation procedure (i.e., select the assignment dose for the next patient when the decision is to continue the trial) from the stopping problem and fix it to some given policy. This choice will potentially result in easier implementation in practice because most clinical trials are still administered with a balanced randomized allocation policy (similar to the setting studied in <ref type="bibr">Chick et al. (2017)</ref> and our case study, <ref type="bibr">Hall et al. (2011)</ref>). In fact, we include a randomized allocation policy in addition to other adaptive benchmarks in our sensitivity analyses to demonstrate the performance of the proposed stopping rules with respect to different dose allocation policies.</p><p>In order to approximate the probability of success in Phase III, it is natural for the DM to consider the power of the hypothesis test H 0 : df * &#8804; 0 versus H 1 : df * &gt; 0, where df * denotes the expected response improvement over a placebo or standard treatment (assuming higher responses are favorable) <ref type="bibr">(M&#252;ller et al. 2006)</ref>. By approximating the probability of success in the next phase, and considering the cost of sampling patients and monetary benefits of identifying a significant improvement over the placebo, a utility function is constructed capable of incorporating costs and benefits as well as the quality of the stopping decisions, thus fulfilling our motivation. Moreover, adaptive designs of clinical trials may address some ethical issues. An adaptive approach considering early abandonment is ethically motivated because it may prevent the DM from assigning patients to ineffective or toxic doses in early stages. Also, one can argue that considering a term for approximating the probability of success and ensuring a level of certainty for treatments that clear Phase II to be tested for a larger population in Phase III is also ethically motivated. However, ethical debates on the design of clinical trials are not settled; see, for example, <ref type="bibr">Berry et al. (2004)</ref> and <ref type="bibr">Bothwell and Kesselheim (2017)</ref> for an extensive literature that spans a couple of decades.</p><p>Two main challenges are involved in the definition of df * in a Bayesian setting: (i) the target dose is a random variable at the beginning of each decision epoch given the history of the states, actions, and observations; and (ii) the expected response of any dose (including the target dose) is also a random variable at the beginning of each decision period given the said history. These conditions introduce difficulties in evaluating the distribution of df * . Addressing such challenges requires a proper dose-response model and a stochastic DP setup, which are discussed in Sections 3 and 4, respectively. The resulting stochastic DP formulation, however, suffers heavily from the curse of dimensionality because the state space is multidimensional and unbounded.</p><p>For this problem, <ref type="bibr">Brockwell and Kadane (2003)</ref> proposed an approximation procedure, which was partially applied to the optimal stopping of a fully Bayesian dose-finding trial <ref type="bibr">(Berry et al. 2002)</ref>. The approximation is based on discretizing the state space over a grid, using forward simulation until the last decision epoch to create sample paths, and utilizing backward induction to estimate the value function in each cell of the grid at each time period. This method is computationally extremely time consuming, and <ref type="bibr">Berry et al. (2002)</ref> stated that applying this method at each decision epoch in a fully adaptive design is "impractical." Moreover, <ref type="bibr">Grieve and Krams (2005)</ref> noted computational difficulties of this method as the main reason for considering posterior estimates instead of forward simulation and backward induction as a basis for the stopping rule in the Acute Stroke Therapy by Inhibition of Neutrophils (ASTIN) trial. Therefore, the main goals of this study are (i) developing computationally efficient algorithms for the optimal stopping problem of a dose-finding trial and (ii) testing the performance of the proposed policies with available benchmarks via simulation in settings where the correct stopping decision is abandonment or termination.</p><p>We propose two solution methods for this problem. The first one adapts the one-step look-ahead framework, in which the DM assumes that the next decision epoch is the last one. This approach is computationally much less demanding than the benchmark method because it only requires one-step forward simulations. However, the induced stopping time by this method may happen earlier than the optimal stopping time (Proposition 1). In the second proposed method, we consider a two-armed bandit version of the problem with one unknown arm (the target dose) <ref type="bibr">Nasrollahzadeh and</ref><ref type="bibr">Khademi: Optimal Stopping of Adaptive Dose-Finding Trials Service Science, 2020, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> and one known arm (placebo), where the posterior of df * (i.e., the advantage of the target dose over placebo) is normally distributed as a result of our normality assumption on patients' responses. Therefore, in a continuous sampling regime, a scaled mean of df * follows an It&#244; process, which enables us to formulate a continuous-time Bellman equation for the continuoustime optimal stopping counterpart. By using It&#244;'s lemma, we show that the optimal value function to the continuoustime Bellman equation satisfies a partial differential diffusion-advection equation with boundary conditions (Proposition 2). In addition, the solution to the partial differential equation depends only on the utility function (objective function of the DM), which can be found up front, and identifies a continuation region over the mean response (vertical axis) and time (horizontal axis), which is easy to understand and implement. We show how the two-armed bandit results can be used to make effective decisions in the setting of multiple doses and adopt the method for the original stopping problem where the patients arrive in discrete decision epochs. This method is also computationally appealing because it bypasses forward simulations, which are computationally time consuming to find the decision regions.</p><p>In summary, our formulation is a methodological exercise in exploring the potential advantages of the diffusion approximation method in correctly terminating, abandoning, or continuing a clinical trial. We test the performance of the two proposed methods along with the benchmark developed by <ref type="bibr">Berry et al. (2002)</ref> via simulation. We also conduct a case study where data from a real clinical trial <ref type="bibr">(Hall et al. 2011)</ref> are used to investigate the performance of the proposed solutions in a real-world setting. In addition to the monetary value of a stopping decision, which is the primary objective function, we report the probability of correct decision at the stopping time for each method. We test the results on two settings: one where there is a significant difference between the average response of the target dose and placebo (the ultimate decision is termination) and one where the said difference is negligible (the ultimate decision is abandonment). Our simulation results shed light on the behavior and performance of each method and reveal an important insight on the performance of these methods for implementation in practice. Although the adaptive optimal stopping problem is motivated because classic solutions require complex simulations and extensive computational efforts, our results show that there is a key distinction in performance with respect to treatments that do not produce significant advantageous responses over the placebo. In fact, we show that the diffusion approximation is particularly effective in correctly abandoning ineffective treatments, whereas the one-step look-ahead policy and the simulation-based gridding fail to correctly abandon the trial in our setting. This result is also emphasized in our case study where the original trial did not conclude with a clear termination or abandonment result and suggested further investigation to establish the efficacy of the underlying treatment. In our retrospective analysis of the trial, the diffusion approximation method also decided to continue the trial because positive evidence suggested efficacy, but they were not significant enough to justify a termination decision. Because abandoning the trial as a result of futility of the treatment is an important factor in reducing the cost and health risks associated with clinical trials, our results are of important potential practical value in designing adaptive Phase II clinical trials.</p><p>Appendices in the online supplement include detailed discussions of the dose-response approximation model, proofs, details of the proposed algorithms, additional numerical results, and sensitivity analyses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Optimal stopping is an important decision making problem and is studied in different communities because of its vast applications. The optimal stopping of a clinical trial has also received significant attention because of its importance. For an overview of advancements in optimal stopping of clinical trials, see a survey by <ref type="bibr">Hee et al. (2016)</ref> from a Bayesian perspective and <ref type="bibr">Jennison and Turnbull (1999)</ref> from a frequentist perspective, as well as references therein. In addition, <ref type="bibr">Jitlal et al. (2012)</ref> and <ref type="bibr">Deichmann et al. (2016)</ref> reviewed hundreds of clinical trials considering early stopping rules as a result of futility, effectiveness, or safety concerns across different phases. <ref type="bibr">Stallard et al. (2001)</ref> reviewed different stopping rules specifically for Phase II clinical trials.</p><p>Here, we only focus on Bayesian decision-theoretic methods developed for optimal stopping of dosefinding trials. One main method used in Bayesian decision theoretic designs is based on forward simulation of the trial to the end and using backward induction to estimate the value function over a grid to ultimately evaluate the stopping region. Details of this methodology are presented in <ref type="bibr">Brockwell and Kadane (2003)</ref>, and it is adapted in <ref type="bibr">Berry et al. (2002)</ref> for a fully adaptive trial. However, this method is computationally extremely demanding, and we propose two more efficient algorithms to solve this problem. Our simulation results compare the performance of the proposed solutions and the method developed by <ref type="bibr">Berry et al. (2002)</ref> with respect to estimated and true utility, stopping time, and the probability of correctly deciding whether to terminate or abandon the trial.</p><p>Our first proposal is to adapt the one-step lookahead approach to the optimal stopping of dosefinding trials. This approach is used for sequential sampling in Bayesian settings, and <ref type="bibr">Frazier and Powell (2008)</ref> applied this method for the optimal stopping of a ranking and selection problem. However, the optimal stopping of a dose-finding trial is different from a ranking and selection setup because the DM has to consider the effects of the termination decision on the next confirmatory phase of the drug development process. This is accommodated for by a hypothesis test in which the significance of the advantage of the target dose over placebo, calculated by subtracting the placebo response from that of the (random) target dose, is tested. Rojas-Cordova and Bish (2018) analyzed an adaptive sequential allocation and termination of the trial and investigated the trade-off between establishing the efficacy early on and more sampling to increase the accuracy of the estimation. However, their setup is different in objective and is focused on Phase III trials with binary responses.</p><p>Our second approach is inspired by a work presented in <ref type="bibr">Chernoff (1961)</ref>, where a diffusion approximation is used to test whether the mean of a normal distribution is positive. Such a method is used in the optimal stopping of a Phase III clinical trial <ref type="bibr">(Chick et al. 2017</ref>) and in the optimal stopping of a clinical trial with correlated treatments <ref type="bibr">(Chick et al. 2018)</ref>. However, the structure of our problem is different from previous studies because the DM has to consider the power of a hypothesis test. Moreover, the heuristic to extend the diffusion results to multiple doses settings is particularly tailored to our setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dose-Response Model</head><p>The relationship between the treatment dose of a drug and its induced response (e.g., change in a measurable medical outcome) is essential in dose-finding studies and is usually described by a curve or function referred to as a dose-response curve. For example, Figure <ref type="figure">1</ref> presents three typical dose-response curves, where the sigmoid shape in Figure <ref type="figure">1</ref>(a) is one of the most recurring dose-response relationships in theory and practice <ref type="bibr">(Gadagkar and Call 2015)</ref>.</p><p>There are two main classic approaches to estimate the dose-response curve. The first approach considers a functional form up front (parametric), dictating the shape of the underlying dose-response curve, and seeks to estimate the parameters by adapting the model to available observations (e.g., <ref type="bibr">Kotas and Ghate 2018)</ref>. Another approach is to use a piecewise linear approximation (nonparametric) to the curve over a discrete set of available doses <ref type="bibr">(Berry et al. 2002)</ref>. Here, we present a first-order Bayesian nonparametric piecewise linear approximation to the curve, and we refer the reader to Online Supplement Section 1 for further discussion on the choice of the dose-response model, the pros and cons of the parametric and nonparametric models with respect to misspecification error, and their computational and potential practical implications.</p><p>Denote the numerical score of a patient's response by y and the prescribed dose by z &#8712; Z, where Z : {Z j : j 1, . . . , J} is the set of all admissible doses. Let &#920; (&#952; 1 , . . . , &#952; J ) denote a J-sized column vector of unknown parameters, where &#952; j refers to the mean response to dose Z j (index j is used in lieu of Z j throughout this work). In particular, we assume that at any given dose j, the response is given by y j &#952; j + , where &#8764; N (0, &#963; 2 ) (see, e.g., <ref type="bibr">Berry et al. (2002)</ref>), and &#963; 2 refers to the uncertainty in observing a patient's response. By this construction, the response of patients at dose j is normally distributed with an unknown mean &#952; j and a known variance &#963; 2 . We can interpret that such construction approximates the dose-response curve by fitting a piecewise linear function connecting consecutive &#952; j s. We consider a Bayesian setup where the DM has a multivariate normal belief about &#920; and updates her belief by each patient's response. Service <ref type="bibr">Science, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> Although some early Phase II studies investigate binary or categorical responses to describe the success or failure of a treatment, finding the most promising dose usually involves a continuous response <ref type="bibr">(Berry et al. 2010</ref>). In fact, there are cases when categorical responses are produced by considering a priori cutoff thresholds for underlying continuous responses (European Network for Health Technology Assessment 2013). Patients' responses such as improvement in life-years, progression free survival, blood pressure, temperature, weight, Hamilton depression score, time to event (e.g., time to heal), and various scoring systems are some examples of continuous responses (see, e.g., <ref type="bibr">Biswas and Bhattacharya (2016)</ref>). In constructing this dose-response model, we assume that the error is normally distributed with a known variance. This is a standard assumption, as the primary endpoint (response) in the majority of trials with continuous response is assumed to follow a normal distribution <ref type="bibr">(Julious 2004</ref>). See, for example, <ref type="bibr">Krams et al. (2003)</ref> for implementation of such a setup in practice. In addition, our case study also considers a continuous outcome, as it reports the probability of achieving a five-year quality-adjusted life-years (QALYs) for standard and treatment arms. Furthermore, considering a known variance is usually justified for Phase II because an estimate of the variance is already available from Phase I results <ref type="bibr">(Spiegelhalter et al. 2004</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Target Dose</head><p>The ultimate goal of a Phase II study is to identify a target dose. We focus on the efficacy of the target dose in Phase II of dose-finding clinical trials in order to define a utility function for the stopping decision, and thus ED 95 , the smallest dose achieving 95% of the maximal response, is considered as the target dose. We formally define ED 95 as</p><p>where z max denotes the dose with maximal response. The target dose is different from the assignment dose given to the patients throughout the trial by a fixed allocation policy. The responses of patients to these assignment doses are used to update the mean response of different doses. The motivation for ED 95 is that the highest response may correspond to high dosages (toxic doses), which may induce undesired adverse side effects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Problem Formulation</head><p>In this section, we present a stochastic DP formulation for the response-adaptive optimal stopping of dosefinding Phase II clinical trials. At each decision epoch, using the information accrued so far, a DM decides whether to (i) abandon the trial because of significant evidence of the inefficacy of the treatment, (ii) continue the trial to collect more information if there is insignificant evidence of the efficacy of the treatment with an expectation of improvement, or (iii) terminate the trial for efficacy and move to a confirmatory study when efficacy is verified by testing the treatment for a large population.</p><p>The optimal stopping problem is accompanied by an allocation procedure that upon continuation decision identifies the next dose assignment. In this work, we assume that the allocation decision is made according to some predetermined rule. In particular, we use a one-step look-ahead policy introduced in Nasrollahzadeh and Khademi (2018) for dose assignment. We test the effects of other dose assignment procedures on the performance of our proposed solutions in Online Supplement Section 4.6. The objective of the optimal stopping problem is expressed in terms of monetary values. This objective, which is known as net present value, is appropriate in stopping problems where sampling costs/rewards are financial measures <ref type="bibr">(Brealey et al. 2012)</ref>.</p><p>Recall that &#920; represents a vector of unknown expected responses corresponding to doses in set Z where the resulting dose-response function is approximated by connecting consecutive &#952; j 's. Let n denote decision epochs, let N be the total number of (potential) homogeneous patients in the trial, and let y n+1 &#952; z n + n+1 be the observed response of patient n + 1 after assignment to dose z n , where (y n+1 |&#920;, z n ) &#8764; N (&#952; z n , &#963; 2 ). We assume that the response of a patient is observed before the next decision epoch. Define F n as the &#963;-algebra generated by z 0 , y 1 , z 1 , y 2 , . . . , z n-1 , y n . Note that z 0 is the assignment dose before observing any response, z n is given by the fixed allocation policy, and &#964; represents some stopping time at which F &#964; describes the accrued information gathered by sampling &#964; patients. We use y and &#375; to denote true and simulated observations, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">State Space</head><p>Decision epochs are set at the times when a patient becomes available. We assume a possibly correlated multivariate normal prior on our belief about &#920;; that is, &#920; &#8764; N (&#956; 0 , &#931; 0 ). Observations y form a normal likelihood distribution resulting in a Bayesian conjugate setup where posterior distributions on &#920; are also multivariate normal. Define &#956; n : E[&#920;|F n ] and &#931; n : Cov[&#920;|F n ] as posterior moments of the belief about &#920;. At decision epoch n, the DM decides on abandoning, continuing, or terminating only based on the current estimate of the dose-response curve, which is summarized by the posterior distribution on parameter &#920; given historical information F n (i.e., P(&#920;|F n )). This posterior canbe completely described Nasrollahzadeh and Khademi: Optimal Stopping of Adaptive Dose-Finding Trials by the state variable s n (&#956; n , &#931; n ). Thus, the state space S is defined as</p><p>where S J + denotes the set of J &#215; J positive semidefinite matrices, and , denotes an absorbing state showing the end of the decision-making process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Action Space</head><p>At each decision epoch, if enough evidence (in the form of current estimate of the dose-response) has emerged to suggest that an effective target dose is identified, and sampling more patients will not improve the estimate by a significant margin considering the cost of sampling, the DM may decide to "terminate" the trial and switch to a confirmatory phase where the target dose is further tested to confirm its effectiveness. On the contrary, the DM might learn that the current estimate of the doseresponse curve shows no signs of effectiveness (e.g., a flat dose-response curve), and sampling more patients will only increase trial costs, and thus the DM may "abandon" the trial. However, if the current estimate of the dose-response curve suggests that an effective target dose may be identified, and continuing the trial with more sampling potentially may lead to a significant improvement of the estimate and utility, then the DM may "continue" the trial by allocating a dose to the next patient, observe the response, and update the current estimate of the dose-response curve. Recall that the allocation scheme is assumed to be given and independent of the optimal stopping problem. Thus, define a n (s) &#8712; {0, 1, 2} as the decision variable when in state s, where 0 shows that the decision is to abandon the trial, 1 shows the continuation of the trial, and 2 shows that the trial is terminated. Thus, the action space is described by A(s) : {a n (s) &#8712; {0, 1, 2}, &#8704;n &#8804; N}, where at stopping time n &#964;, or n N, a n &#8712; {0, 2}. For s ,, set A(s) : &#8709;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Transitions</head><p>Terminating or abandoning the trial at decision epoch n determines the stopping time as &#964; n, and the system transits to state ,, where no more sampling is allowed and the current estimate of the dose-response curve remains unchanged. However, if the decision is to continue the trial, a dose is selected according to an allocation policy, and its observed response will be used in order to update the current estimate of the dose-response curve (i.e., transit to a new state). The new state s n+1 (&#956; n+1 , &#931; n+1 ) is described by</p><p>where j denotes the allocated dose, &#963;(&#931; n , j) :</p><p>&#931; n e j &#8730; (&#963; 2 +&#931; n jj ) , e j is a J-vector of 0's and a single 1 at the jth index, and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Objective Function</head><p>We consider maximizing a monetary equivalent of benefits acquired as a result of early termination or abandonment of the trial versus costs incurred by continuing the trial with more sampling. If the decision is to abandon the trial (i.e., a n 0), then no immediate reward or cost is incurred. If the decision is to continue the trial (i.e., a n 1), then only a sampling cost c 1 &gt; 0 is paid. When the decision is to terminate the trial (i.e., a n 2), the immediate reward consists of the monetary value of the advantage over a placebo, if such an advantage is significant, minus the setup/ sampling cost in the confirmatory phase. Define utility function u(a n , s n , F n ) as the expected immediate benefit (rewardcost) incurred when deciding on action a n in state s n given information F n by u a n , s n , F n ( )</p><p>where c 1 n p is the cost of sampling n p patients in the confirmatory phase (c 1 &gt; 0), and c 2 &gt; 0 is the payoff per unit advantage of the target dose over the placebo. Considering economic health outcomes as primary or secondary objective is well established in the literature and practice. For example, refer to <ref type="bibr">Flight et al. (2019)</ref> for a review of clinical trials with economic health outcomes. A typical outcome in clinical trials is QALYs gained, which may have a corresponding monetary value that can be used to estimate the benefit per unit improvement over the placebo (see, e.g., <ref type="bibr">Hall et al. (2011)</ref>). Let m n E[df * |F n ] denote the expected advantage over the placebo where df * &#952; z * -&#952; 0 , let z * be the (random) target dose, and let &#952; 0 be the known and fixed response of the placebo. Because z * is random with respect to F n , &#952; z * denotes the posterior expected response at dose z * , and thus, df * identifies the posterior advantage over the placebo. Furthermore, the indicator function 1 {B n } determines the significance of the advantage over the placebo by considering the event B n in which the null hypothesis is rejected when comparing H 0 : df * &#8804; 0 with H 1 : df * &gt; 0. In particular,</p><p>where &#563; * and &#563;0 denote n p -sample average responses of the estimated target dose and placebo at decision  <ref type="formula">1</ref>); create n p samples from N (&#952; z * , &#963; 2 ) and N (&#952; 0 , &#963; 2 ); calculate &#563; * and &#563;0 and identify whether the event B n occurs; and continue this process for enough samples and take a sample average to estimate E[1 {B n } |F n ]. The utility function defined here is tailored to Phase II clinical trials and is designed to capture the important tradeoffs in this decision-making process. For example, if the trial stops early because of futility, the treatment has no merit, which is reflected in its reward of 0. The cost of sampling patients up to that point is captured by the cost associated with the previous continuation decisions. If positive evidence for the advantage over the placebo emerges, the DM has to make sure that the evidence is significant and then consider the cost of setting up a confirmatory phase in addition to the cost of sampling patients up to that point.</p><p>Given that a decision to abandon or terminate the trial has been made at stopping time n &#964;, the optimal expected utility is given by</p><p>Therefore, for every n &lt; &#964;, the decision has to be a n 1, and a sampling cost c 1 is paid as the expected immediate utility (i.e., u(a n 1, s n , F n ) -c 1 ). Let l &#960; (s 0 ) denote the expected utility at stopping time &#964;, given historical information F &#964; under policy &#960; when the initial prior on the belief about &#920; is s 0 (&#956; 0 , &#931; 0 ); that is,</p><p>where &#928; is the set of all nonanticipative admissible policies, and the DM selects a policy &#960; &#8712; &#928; such that V(s 0 ) sup &#960;&#8712;&#928; l &#960; (s 0 ). Therefore, the optimal value function is the solution to the following optimality equations:</p><p>The state space defined on &#920; is unbounded, and thus standard stochastic DP techniques are computationally intractable. We describe and implement the solution developed by <ref type="bibr">Berry et al. (2002)</ref> and propose two different alternative techniques to solve the optimal stopping problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Approximate Solutions</head><p>In this section, we explain three approximate methods. <ref type="bibr">Berry et al. (2002)</ref> used a simulation-based gridding algorithm discussed in Section 5.1 to evaluate stopping times. This approach is computationally extremely expensive in implementation and gives rise to "static terminators" where for a few sample doseresponse curves, a large number of trials are simulated forward in time to compute their average expected utility over a discretized grid by backward induction. These approximations are used statically to evaluate stopping times for "similar" dose-response curves. Section 5.2 proposes a one-step look-ahead policy to find stopping times, and Section 5.3 proposes a diffusion approximation method, which are computationally more efficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Simulation-Based Gridding Approximation</head><p>Brockwell and Kadane (2003) and <ref type="bibr">M&#252;ller et al. (2007)</ref> proposed an approximation method for the problem defined in Section 4 where the state space is discretized by a grid over which a sufficient number of experiments are run to estimate the final-stage value function. The key idea is that in each cell of this grid-say, cell j-the value of termination and abandonment can be evaluated easily. The value of continuation is the sample average of all the cells that are visited in the next decision epoch by the experiments currently visiting cell j. We present the details of the approach for completeness and to clarify the differences between the assumptions used in this approach with those in our setup.</p><p>To construct the grid, <ref type="bibr">Berry et al. (2002)</ref>  . In order to estimate m n and &#957; 2 n , after the current &#920; is evaluated, simulate samples from &#920;, identify the target dose for each sample, and calculate the posterior mean and standard deviation of df * through sample mean and variance estimation. Record the trajectory of each trial (i.e., the sequence of (m n , &#957; n )) over the bivariate grid for (m, &#957;). For example, Figure <ref type="figure">2</ref>(a) shows 30 trial trajectories of m n on a simplified univariate grid (only m n versus n) for N 10. It is likely that some of the grid cells remain empty; that is, no simulated trials resulted in m and &#957; values corresponding to that cell, which affects the approximation quality. To fix that, consider a particular (m n , &#957; n ) corresponding to those cells as priors and simulate a number of trials starting from those cells. Thus, the entire grid is populated.</p><p>Remark 1. We assume a correlated multivariate normal prior on &#920;; that is, &#920; &#8764; N (&#956; 0 , &#931; 0 ). However, z * =ED 95 is random with respect to F n , and thus, &#952; z * is not normally distributed with respect to F n . In the simulationbased gridding algorithm, the actual unknown distribution of &#952; z * is approximated by a normal distribution in the literature. However, we do not make such an assumption in our proposed solutions.</p><p>To evaluate the optimal decision in each cell, start from the last decision epoch N, when the continuation decision is not possible, and the value function can be computed by Equation (5). Denote by A n j the subset of indices i &#8712; {1, . . . , M} whose trajectories terminate in the jth cell (which corresponds to an (m, &#957;) pair) in the grid (m n , &#957; n , n). For the last decision epoch N, this is demonstrated by darker trajectories that end up in a specific cell in Figure <ref type="figure">2</ref>(a). The termination utility function in the jth cell is evaluated by taking a sample average of the value functions corresponding to trial simulations whose trajectories terminated in that grid cell; that is,</p><p>where &#219;N j (a N 2) is the approximated utility function at decision epoch N in the grid cell j when the decision is to terminate the trial, | &#8226; | denotes set cardinality, and the utility function</p><p>, where m N j and &#957; N j correspond to the jth cell values for m and &#957;, respectively. Therefore, the expected utility of termination at the last decision epoch N is given by</p><p>where</p><p>) 2 denoting the posterior predictive variance of &#563; * -&#563;0 . Thus, the approximated value function in each cell of the grid at decision epoch N is</p><p>where if V * ,N j 0, the optimal decision is to abandon the trial in the jth grid cell (i.e., a * ,N j 0). Otherwise, the optimal decision is to terminate the trial, a * ,N j 2. Service <ref type="bibr">Science, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> Working backward, the utility function in the jth cell for n &lt; N when the decision is to continue the trial is given by</p><p>where j(i) denotes a cell that trajectory i visits at decision epoch n + 1. Therefore, the approximated value function in each cell of the grid at decision epoch</p><p>Enumerating the entire grid backward until decision epoch n identifies the optimal decision and value function for each cell. Figure <ref type="figure">2</ref>(b) shows a hypothetical example of optimal decisions on the grid (m, &#957;) at a particular decision epoch n. Algorithm 1 in Online Supplement Section 3 describes the method in more detail. <ref type="bibr">Berry et al. (2002)</ref> applied this approach under a set of "typical dose-response curves" where the approximate value function for each grid cell was computed by taking the average of expected utilities under the same set of dose-response curves. Therefore, when a true observation from a dose-response curve investigated in the trial becomes available at decision epoch n, an (m n , &#957; n )-tuple is evaluated, and depending on which grid cell it falls into, the optimal decision is identified. This approach may be problematic particularly when the unknown dose-response curve does not closely resemble those in the typical set. Furthermore, when the shape of the dose-response curve is unknown and a response-adaptive dynamic allocation scheme is trying to learn it, the resulting response-adaptive optimal stopping problem becomes computationally demanding. In Sections 5.2 and 5.3, we propose alternative methods that are significantly more efficient and can be used in a fully sequential setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">A One-</head><p>Step Look-Ahead Policy <ref type="bibr">Frazier and Powell (2008)</ref> proposed a kind of one-step look-ahead policy (knowledge gradient) to optimal stopping of ranking and selection problems by assuming that the experiment has to terminate at the next decision epoch. We adapt such a framework into the optimal stopping of a dose-finding trial with unique challenges. In particular, we consider three actions at each decision epoch-abandonment, continuation, and termination-whereas in most standard ranking and selection problems, only continuation and termination decisions are available. Furthermore, our utility function consists of</p><p>which is emanated from evaluating the significance of the advantage over the placebo via a hypothesis test, when the decision is to terminate the trial.</p><p>To quantify the value gained in continuing the trial, define V KG a n 1 (s) as a function that measures the difference between terminating or abandoning the trial at time n and continuing the trial, incurring the cost of sampling, and terminating or abandoning the trial at time n + 1; that is,</p><p>where the knowledge gradient policy &#960; KG , hereafter referred to as the KG policy, decides to continue the trial (i.e., a &#960; KG (s n ) 1), when V KG a n 1 (s n ) &gt; 0. In the case that V KG a n 1 (s n ) &#8804; 0, the optimal decision is identified by a &#960; KG (s n ) &#8712; arg max a n &#8712;{0,2} u(a n , s n , F n ). Note that a &#960; KG (s) is a function returning the optimal decision selected when in state s n under the KG policy &#960; KG . In order to evaluate V KG a n 1 (s n ), one needs to estimate both the current expected utility function u(a n , s n , F n ) and the one-step utility function u(a n+1 , s n+1 , F n+1 ) by taking a sample average (Monte Carlo). Algorithm 2 in Online Supplement Section 3 presents the details of this procedure. This approach replaces multistep forward simulations of the trial from decision epoch n to N by one-step forward simulations, which significantly reduces the complexity and computational time of the algorithm.</p><p>The following result bounds the optimal decision from below, and it shows that the KG policy may stop sooner than the optimal policy; that is, whenever the KG policy decides to continue the trial, the optimal decision is also the continuation of the trial. This proposition motivates a sensitivity analysis with respect to the history of the trial. We later show that stopping sooner than the optimal policy may result in a low probability of correct decision in certain situations. This is also showcased in our case study where both the simulation-based gridding and the KG policy result in unacceptable probabilities of correct decision.</p><p>Proposition 1. The optimal stopping time &#964; is bounded below by the KG stopping time &#964; KG ; that is, &#964; KG &#8804; &#964;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Diffusion Approximation</head><p>Although the complexity and computational time of the knowledge gradient method is significantly better than the simulation-based gridding method, both require forward simulations to approximate the optimal solution to the value functions in (7). Instead, we propose a method that assumes a prior belief about Nasrollahzadeh and Khademi: Optimal Stopping of Adaptive Dose-Finding Trials the actual benefit of the target dose over the placebo and approximates its increments over time by a continuous-time Wiener process, which enables us to construct the optimal stopping boundaries up front. This framework is inspired by <ref type="bibr">Chernoff (1961)</ref>, where a diffusion approximation is used to sequentially test whether the drift of a Wiener process is positive. Our approach also approximates the stopping time of sequential normal means (i.e., the advantage over placebo) by solving a continuous-time Bellman equation. To that end, we first consider a setting where there is a single unknown dose versus a known placebo. Then, we design a heuristic that uses the said boundaries for decision making in multiple doses settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1.">A Single Dose with Unknown Mean Response</head><p>vs. a Placebo. For now, assume that the trial involves a placebo with known mean response and a single dose with unknown mean response. Without loss of generality, assume that y 0 &#8764; N (0, &#963; 2 ) and y * &#8764; N (&#952;, &#963; 2 ), where &#952; is unknown and a prior &#952;</p><p>. For a single dose, the advantage over placebo is given by df * &#952; -0 &#952;; see Section 4. Therefore, at each time period, a sample from the dose with unknown mean response is observed, and the posterior of &#952; and, therefore, df * , becomes df * | F n &#8764; N (m n , &#963; 2 t n ), where</p><p>In this setting, df * naturally follows a normal distribution. Recall that in the utility calculation there is an expectation to calculate, which by this construction has a closed form. In particular,</p><p>)</p><p>where</p><p>, 2&#963; 2 + &#963; 2 t n is the posterior predictive variance of &#563; * -&#563;0 , and &#934;(&#8226;) denotes a normal cumulative distribution function.</p><p>Redefine the state variable &#349; (m n , t n ), and using &#349;0 (m 0 , t 0 ), let l&#960; (&#349; 0 ) denote the expected utility at stopping time &#964; under policy &#960; &#8712; &#928; when the prior is parameterized by (m 0 , t 0 ); that is,</p><p>where the DM selects a policy &#960; &#8712; &#928; such that V * (&#349; 0 ) sup &#960;&#8712;&#928; l&#960; (&#349; 0 ). Define x 0 m 0 t 0 and x n x 0 + &#8721; n i 1 &#375;i * , where m n x n t n . Using these definitions, the state variable can be rewritten as &#349;n (x n , t n ). Let G(x &#964; , t &#964; ) denote the expected utility at the stopping time given by</p><p>Because the utility functions are uniformly bounded for any state and action, and the action space is finite, there exits a Markovian and deterministic optimal policy. The optimal policy to V * (m 0 , t 0 ) sup &#960;&#8712;&#928; l&#960; (m 0 , t 0 ) is the solution to the following Bellman equation:</p><p>where t n+1 t n + 1, and x n+1 x n + &#375;n+1 * . Optimality equation ( <ref type="formula">16</ref>) has a continuous state space, and thus it is computationally intractable to solve. Therefore, in order to approximate the solution to ( <ref type="formula">16</ref>), suppose that patients' responses are observed continuously rather than at discrete decision epochs t n . This assumption is necessary for a rigorous development of the method and is not required for implementation in practice. In fact, we use stopping boundaries developed in a continuous regime for the stopping decisions of trials where patients arrive in discrete time epochs. One can think of the continuous equivalent of the cumulative sum x n x 0 + &#8721; n i 1 &#375;i * as a Brownian motion with unknown drift &#952; and variance &#963; 2 per unit time, which satisfies the following stochastic differential equation:</p><p>where x t is an extension of x n to continuous real values for real-valued t, and W t is a standard Brownian motion. Extend the definition of filtration F n to be the natural &#963;-algebra generated by the process {x t } t&#8712;[t 0 ,t n ] (i.e., F t&#8712;[t 0 ,t n ] ct ). Therefore, the continuous-time approximation of the Bellman equation in ( <ref type="formula">16</ref>) is given by</p><p>The following proposition shows that B ct (x t , t) is the solution to a free boundary problem with a partial differential diffusion-advection equation and two boundary conditions. </p><p>where B ct (x t , t) G(x t , t) outside of the set C . The free boundary &#8706;C is given by</p><p>Boundaries to the continuation set C can be found without any trial simulation, which significantly reduces the complexity and computational effort required to obtain optimal stopping times. The solution algorithm to the free boundary problem is described in Online Supplement Section 3. We use trinomial tree discretization method to solve the partial differential problem of Proposition 2. Figure <ref type="figure">3</ref> demonstrates an example of the solution to the free boundary problem.</p><p>The approximated optimal decision is identified by determining the region of the state variable after each new observation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2.">Multiple Doses with Unknown Mean</head><p>Responses. In the previous subsection, we construct the continuation boundaries where there is only a single dose with an unknown mean response. However, the original problem consists of multiple doses for which the mean response is unknown. Therefore, the target dose z * ED 95 is random, and each continuation decision may yield a different target dose with respect to the sample path. This results in an unknown distribution for df * when multiple doses are considered. In fact, if there was only a single dose with an unknown mean response, the allocation dose for continuation decisions and the target dose were similar, and the posterior advantage over the placebo, df * , was distributed according to a normal distribution. However, in the multiple doses setting, the allocation dose for continuation decisions may estimate a different target dose, which results in df * not to enjoy conjugacy with respect to the patient's response. In the literature, a variety of heuristic approaches have been proposed to extend the results of a single alternative case to multiple alternative settings. See, for example, <ref type="bibr">Chick and Frazier (2012)</ref> and <ref type="bibr">Chick et al. (2018)</ref>. However, those approaches depend on the assumption that the reward is equivalent to the maximum expected reward of simulating the arm with the highest mean, whereas in our formulation, the arm with the highest mean does not necessarily yield the maximum expected utility. Thus, they are not applicable in our setting. Therefore, we propose the following heuristic to extend the single dose setting to the multiple doses case.</p><p>Recall that for each dose Z j , there is a prior on the expected response &#952; j . The diffusion approximation boundaries only depend on the prior and the shape of the utility function. Therefore, for each dose j, we construct the continuation boundaries up front. The idea is that, at each decision epoch, we estimate m * n x * n t n of the target dose z * and make decisions by considering the optimal region corresponding to dose z * to check whether m * n falls into termination, abandonment, or continuation regions. To that end, at each decision epoch, we create a sample from the posterior on &#920;, and for each sample, we use Equation ( <ref type="formula">1</ref>) to find the target dose. Then, we take a sample average to estimate the target dose, and because the said sample average may not be in Z, we round it to the closest dose. Given the estimate of target dose z * , we simply have m * n E{&#952; z * |F n }. The decision is found by referring to the optimal decision region corresponding to dose z * and checking whether m * n belongs to an abandonment, continuation, or termination zone at t n . Assuming a continuation decision at time n, a patient is assigned to a dose according to the allocation scheme. Its response y n+1 is observed and is used to update the estimate of the dose-response curve (i.e., &#920;). Then, this process continues until the stopping time or all patients are tested.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Numerical Results</head><p>In this section, we present implementation results of the simulation-based gridding algorithm, one-step look-ahead policy, and diffusion approximation for a variety of settings. Because the performance of these solution methods may differ depending on the adaptive dose allocation scheme, we assume that the dose allocation algorithm is given by that in <ref type="bibr">Nasrollahzadeh and Khademi (2018)</ref> for all of the solution algorithms. We include a sensitivity analysis in Online Supplement Section 4.6 with different allocation policies to investigate the robustness of our approach. To that end, we implement a typical balanced randomization allocation policy that assigns patients to each dose with equal probabilities and is described as the design norm in many areas of medical research. An example of such an allocation policy is in our case study where the trial had a balanced randomization patient assignment. We also implement an adaptive randomization policy rooted in Thompson sampling presented in <ref type="bibr">Berry et al. (2010)</ref>, chapter 4. To assess the quality of solution methods with respect to termination, abandonment, and continuation decisions, two different types of dose-response curves are tested:</p><p>i. A sigmoid curve with a significant advantage over the placebo, one of the most recurring dose responses in practice (e.g., <ref type="bibr">Gadagkar and Call 2015)</ref>. This curve is used to test the performance of different stopping rules with respect to continuation and termination decisions. For this curve, the optimal decision at stopping is to terminate the trial for efficacy.</p><p>ii. A flat dose-response curve, which is used to assess the quality of different algorithms when the correct decision at stopping is to abandon the trial for futility. For further analysis on the shape of the underlying dose-response curve, see Online Supplement Section 4.5, where we discuss the performance of the proposed methods with respect to a bimodal nonmonotonic dose-response curve.</p><p>Recall that the problem is modeled as a Bayesian Markov decision process, and naturally, the policies are optimal when assessed according to a fully Bayesian setup (i.e., problem instances). In other words, true dose-response curves must be generated randomly from the same prior, and the performance must be measured with respect to the expectation under the particular prior. However, because of computational difficulties in generating results for the simulation-based gridding approach, we assess the performance of these approximation methods with respect to two dose-response curves (a frequentist setting). Assessing different algorithms with respect to a specific configuration is not unprecedented particularly in clinical trials; see, for example, <ref type="bibr">Berry et al. (2002)</ref> and <ref type="bibr">Krams et al. (2003)</ref>. A sigmoid and a flat curve are considered to highlight the performance of these algorithms when facing favorable and unfavorable cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Simulation Initialization</head><p>A typical number of doses under investigation in Phase II of clinical trials is between 4 and 12 (e.g., <ref type="bibr">Berry et al. 2002)</ref>. We consider 11 doses including a placebo. The first dose is considered to be a placebo, and its known and fixed mean response marks the baseline score for any particular treatment. At each decision epoch, if the decision is to continue the trial, a dose must be allocated to the next patient. We use a one-step look-ahead policy to optimally select a dose that minimizes the one-step posterior variance of the target dose ED 95 . Thereafter, the patient's response is generated from the true distribution and is used to update the posterior estimate of the dose-response curve. Aligned with the literature, we assume that the stopping algorithm is applied only after observing the responses of a certain number of patients (e.g., 20) have already been through the trial. The total number of patients volunteered for the trial is assumed to be 400. We assume that the observation variance is known and is fixed at 10 units. A sensitivity analysis is conducted on this assumption in Online Supplement Section 4.4, where we increase the observation variance to test the performance of the proposed policies when observations are less informative. The significance level is considered to be 1% across all experiments.</p><p>We assume that the sampling cost c 1 1,000, sampling cost in confirmatory phase c 1 1,000, and reward per unit advantage over the placebo c 2 1,000,000. This is to replicate the original settings of the study by <ref type="bibr">Berry et al. (2002)</ref>. It is also a reasonable initialization for our case study where c 1 and c 2 are in the same range. We also conduct a sensitivity analysis in Online Supplement Section 4.7, where we investigate the sensitivity of the proposed solutions to the ratio of cost/benefit for different configurations of c 1 , c 1 , and c 2 . The prior (&#956; 0 , &#931; 0 ) is set according to &#956; 0 (0, . . ., 0), and &#931; 0 is initiated by a Gaussian covariance function where Cov(&#952; i , &#952; j ) &#946; exp{-&#947;(ij) 2 }, where &#946; is usually estimated by Var(&#952; i ). The Gaussian structure of the covariance function allows for less correlation when doses are farther apart. To keep the symmetry of the covariance matrix, &#946; is chosen to be equal to Var(&#952; i ) + Var(&#952; j ) 2 100, and &#947;, the lengthscale factor, is set to 0.01 for both sigmoid and flat curves. This is to ensure that initial values of the expected responses carry little prior information about the shape of the dose-response curve. A thinning factor of 5 is used in generating random variables where every fifth random variable created is used to avoid serial correlation in the sequence of random numbers. In reporting the results, 30 simulations with different sequences of random numbers are considered. The simulation is coded in the R programming language and is run on an Intel core i7 3.7 GHz processor with 16 GB of RAM.</p><p>Nasrollahzadeh and Khademi: Optimal Stopping of Adaptive <ref type="bibr">Dose-Finding Trials Service Science, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> In case of the simulation-based gridding algorithm, recall that the advantage over the placebo (i.e., df * ), in the literature, is assumed to be normally distributed according to N (m, &#957; 2 ). The prior values for m 0 and &#957; 0 are set equal to 0 and 10 to ensure that the prior carries little information about the belief on df * . In constructing the grid over m and &#957;, we consider the range of m to be 20 units <ref type="bibr">(i.e., [0, 20]</ref>) and the range of &#957; to be 10 units (i.e., [0, 10]). The grid is divided into 40 and 20 intervals in the m and &#957; axes, respectively. Initially, to populate the grid, M 1,000 experiments are run, and their (m, &#957;) trajectories are recorded over the grid. Afterward, from each empty cell in the grid, M 10 more simulations are initiated, and their trajectories are recorded. To implement the algorithm in an online fashion, we parallelize forward simulations to speed up the computation. For more details, we refer readers to Online Supplement Section 3.</p><p>For the diffusion approximation method, the prior values for m 0 and &#957; 0 are chosen to replicate those of the simulation-based gridding algorithm. We also assume a similar range for m as in the gridding algorithm (i.e., m &#8712; [0, 20]). The discretization in diffusion approximation is different from the grid construction in the simulation-based gridding algorithm.</p><p>Here, the grid is constructed over values of x and t. Since 20 patients have already been through the trial, t is considered to be in [20 + t 0 , 400 + t 0 ], where t 0</p><p>The details to calculate both axis intervals are given in Online Supplement Section 3. We later do a sensitivity analysis on the grid size of (m, &#957;) and (m, t) for both simulation-based gridding and diffusion approximation policies to investigate whether finer grids improve the performance significantly: see Online Supplement Section 4.8.</p><p>Because the simulation of the trial for all three methods is the same, we report the computational time required to find the stopping decision for each method. At each decision epoch, the gridding algorithm runs a forward simulation and uses backward induction which takes 1 hour on average. In this method the computational time in early stages when there are many patients to consider is considerably longer than the later stages when fewer patients are left. At each time period, the one-step look-ahead policy takes about 30 seconds to find the decision. The diffusion approximation creates the stopping regions up front and for a given dose allocation and its response, finding the stopping decision is instantaneous. These results confirm that the proposed methods are much less demanding than the standard method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Results</head><p>Figures <ref type="figure">4</ref> and<ref type="figure">5</ref> show the state of the dose-response estimation after assigning 20 patients. In particular, Figure <ref type="figure">4</ref>, (a) and (b), shows the posterior estimates to the dose-response curve where each point on the piecewise linear dotted line is the sample average of 30 posterior estimates of &#956; in &#920; &#8764; N (&#956;, &#931;) after observing 20 patients. Furthermore, Figure <ref type="figure">5</ref>, (a) and (b), shows the maximum posterior variance where each point denotes the sample average of 30 posterior estimates of maximum &#931; jj , j 1, . . . , J, for each patient number.</p><p>Tables <ref type="table">1</ref> and<ref type="table">2</ref> show the estimated utility at stopping time, the true utility at stopping, the average stopping time, and the probability of correct decision (PCD) for the three stopping rules with respect to sigmoid and flat dose-response curves, respectively. In reporting the true utility at stopping, we assume that the true underlying dose-response curve is known and thus its ED 95 and the true benefit over the placebo. Therefore, the probability of success at the end of the confirmatory phase by the term E[1 {B n } |F n ] reduces to the probability 1 -&#934;(q &#945; -df * &#773;&#773; &#773; n p &#8730; &#8730; 2&#963; 2 ). What differentiates the policies is the cost of patient sampling, which depends on stopping times corresponding to each policy. In case of detecting a significant advantage over the placebo, the correct decision is to terminate the trial. If the true dose-response curve is flat, abandoning the trial is considered the correct decision. We also include a discussion on the probability of correctly identifying the target dose as a performance measure in Online Supplement Section 4.1. Notice that all three algorithms correctly terminate the sequential sampling process when the dose-response curve is sigmoid with a significant advantage over the placebo. The KG policy stops sooner, but the estimated utility at the stopping time is higher for the simulationbased gridding algorithm in spite of observing more patients. This is because upon stopping it has a slightly higher estimation of the advantage over the placebo and the ratio of the benefit per unit advantage over the placebo is much higher than the sampling cost. Recall that in our setting, this ratio is c 2 1,000,000 c 1 1,000 , and thus detecting even marginal improvements in the estimate of the advantage over the placebo will overwhelm the sampling cost. The simulation-based gridding algorithm is uniquely effective in detecting these marginal improvements given high-quality prior knowledge because it simulates a large number of trials to the end of the horizon and can trade off sampling patients for beneficial marginal improvements in the estimate of the target dose mean response. The KG policy performs weaker in this regard because the decision is made with respect to the advantage over the placebo one step into the future and thus may fail to identify similar trade-off opportunities several steps in the future. The diffusion approximation method achieves lower estimated utility and stops later. In particular, Figure <ref type="figure">6</ref>(a) demonstrates a few simulated state variable paths crossing into the termination region from the continuation region. The average stopping time and expected utilities reported in both Tables <ref type="table">1</ref> and<ref type="table">2</ref> are the average over 30 sample paths. The reported probability of correct decision only considers the stopping decisions and is independent of the target dose selection in case of detecting a significance.</p><p>When sampling from a flat dose-response curve, the gridding algorithm and the KG policy incorrectly terminate the trial most of the times. In the case of the KG policy, as soon as the next step expected utility is estimated to be less than the current one, the policy stops sampling. If the policy overestimates the expected response of the target dose, the current estimated utility may become positive and thus the incorrect decision to terminate the trial instead of abandoning. Table <ref type="table">2</ref> shows that the diffusion approximation algorithm correctly abandons 96% of times when the doseresponse curve is flat, although the average abandonment time comes significantly further in the trial. Also, notice the difference between estimated and true utilities of the simulation-based gridding and KG policy. Both methods incorrectly estimate a monetary benefit, whereas the underlying dose-response curve will only result in sampling costs. Figure <ref type="figure">6</ref>(b) shows a few simulated state variable paths of the diffusion approximation method crossing into the abandonment region. Furthermore, the estimated utility at the stopping time for the diffusion approximation algorithm, although lower than the simulation-based gridding algorithm and the KG policy, is closer to the   <ref type="bibr">, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> true expected utility for the flat dose-response curve. Therefore, one might conclude that the gridding and KG algorithms do not produce reliable estimates when the true dose-response curve is flat. Results show that in this setting, the standard method may produce significantly poor solutions, which may have severe consequences in terms of costs of the next phase and the health of future patients; see Rojas-Cordova and Hosseinichimeh (2018) for a discussion on consequences of misspecification errors in adaptive clinical trials. One approach to address such a shortcoming of the gridding and KG policies is to start considering stopping decisions only if enough evidence is gathered regarding the dose-response curve. This evidence may be interpreted as the accuracy of the dose-response estimation (i.e., the diagonal of the covariance matrix &#931; in state variable s n ). Next, we address such an extension.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">The Effect of History F n</head><p>Motivated by our results, we propose applying the stopping rule only after a certain number of patients' responses have already been observed. As more patients' responses are added to the history, the accuracy of the estimation about &#920; increases. This is because sampling dose j results in lowering &#931; jj , which in turn is a measure of uncertainty about the dose-response estimation at dose j. Therefore, considering a bound on max j Var[&#952; j |F n ] ensures a minimum level of accuracy about the dose-response curve estimation. This modification does not contribute to the complexity of the stopping rule because Var[&#952; j |F n ] &#931; jj is already available to the DM as a part of the state space. We propose the following heuristic: at each decision epoch for any solution method, if the decision is to continue, continue; if the decision is to stop, check whether max j Var[&#952; j |F n ] &#8804; V is satisfied; if it is satisfied, follow the decision; otherwise, continue.</p><p>One can tune V to change the amount of evidence gathered before the stopping decisions are applied. We consider V 4 units in presenting the results. Figure <ref type="figure">7</ref> shows the state of dose-response estimation in terms of posterior estimate to the dose-response curve when max j Var[&#952; j |F n ] &#8804; 4 for the first time. Figure <ref type="figure">8</ref> shows the maximum posterior variance from the start of the trial until max j Var[&#952; j |F n ] &#8804; 4 for the first time. In case of the sigmoid dose-response curve, max j Var[&#952; j | F n ] &#8804; 4 when n &#8805; 280, and sup &#960;&#8712;&#928; l &#960; (s 0 ) 10, 011, 765, which is achieved at patient 286. For a flat dose-response curve, max j Var[&#952; j |F n ] &#8804; 4 when n &#8805; 243, and sup &#960;&#8712;&#928; l &#960; (s 0 ) -312 is achieved at patient 296. Similar to Section 6, Tables <ref type="table">3</ref> and<ref type="table">4</ref> show the performance measures for the three stopping rules with respect to the sigmoid and flat dose-response curve, respectively. See Online Supplement Section 4.2 for state variable paths of the diffusion approximation.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Sensitivity Analyses</head><p>To assess the robustness of the proposed solutions with respect to model/simulation parameters, doseresponse model specifications, and different allocation procedures, we conduct further sensitivity analyses on the variance of observations, shape of the underlying dose-response curve, dose allocation procedure, monetary parameters, and discretization parameters. In particular, in Online Supplement Section 4.4, the variance of observations is increased by a factor of 10 to add uncertainty to the information present in each observation. Our results demonstrate that the proposed policies are robust with respect to the variance of observation in that their relative performance in terms of the estimated utility, stopping time, and probability of correct decision persist for the sigmoid dose-response curve. However, as expected, the number of patients required to satisfy the condition of Section 6.3 increases, which improves the performance of the simulation-based gridding and KG policies because the cost of sampling more patients may cancel out any incorrect identification of the advantage over the placebo. In Online Supplement Section 4.5, the performance of the proposed solutions is investigated for a bimodal curve as a representative of a more general dose-response relationship. The results suggest that the proposed policies are robust with respect to the shape of the underlying dose-response curve. Online Supplement Section 4.6 compares the performance of the proposed policies with respect to the underlying dose allocation procedure. Balanced randomization and another adaptive randomization (developed by <ref type="bibr">Berry et al. (2010)</ref>) are also investigated, which show similar results to Section 6.2. Note that in Section 6.1, we assume that c 1 c 1 $1,000 and c 2 $1,000,000. Such a cost/benefit ratio may not hold up for every trial, and thus Online Supplement Section 4.7 increases the c 1 c 1 by factors of 2 and 5 and decreases c 2 by factors of 10 and 100. The effects of all these combinations are reported on the performance of KG and the diffusion approximation. The results show that diffusion approximation outperforms KG in every case. Finally, the descretization parameters for simulationbased gridding and diffusion approximation methods are changed to generate finer grids. However, our results show that the improvement in both cases is marginal.  <ref type="bibr">, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS 6.5</ref>. Case Study: Application to Clinical Data In order to show the performance of our policies with respect to real data from clinical trials, we simulate the proposed policies for a treatment studied in <ref type="bibr">Hall et al. (2011)</ref> on the effects of genomic test-directed chemotherapy for early-stage lymph node-positive breast cancer. This treatment was studied for its potential on saving people from unnecessary chemotherapy and reducing costs. The trial investigated two treatment methods: standard care, which means chemotherapy for all patients, and genomic testdirected chemotherapy. For participants in the standard care arm, the probability of a five-year diseasefree survival is 0.79 with a standard error of 0.028. The genomic test-directed chemotherapy achieved a 0.88 probability of a five-year disease-free survival on average with a standard error of 0.038. A total of 300 patients were recruited for the preliminary trial and were allocated to each arm according to a balanced randomization policy. Compared with the standard care, the genomic test-directed chemotherapy had an incremental cost of &#8364;860 ($1,125). <ref type="bibr">Hall et al. (2011)</ref> estimated that, on average, the genomic test-directed chemotherapy had 0.16 years in QALYs gained with an incremental cost-effective ratio of &#8364;5,529 per year per patient. A conservative estimate of the market value per QALY is presented in Online Supplement Section 4.3. The significance level is assumed to be 5%. Table <ref type="table">5</ref> summarizes the trials specifications and parameters for this Other parameters not mentioned in Table <ref type="table">5</ref> are initialized according to Section 6.1.</p><p>Table <ref type="table">6</ref> shows the estimated utility, the true utility, the stopping time, and the PCD for the gridding algorithm, the KG policy, and the diffusion approximation method considering the underlying doseresponse curve of <ref type="bibr">Hall et al. (2011)</ref>. Aligned with the actual trial, we use a randomized allocation rule for assigning patients to treatments upon the continuation decision. The results are consistent with our numerical experiments and demonstrate further that the proposed diffusion approximation performance is more reliable. In fact, the standard method produces inaccurate estimates for the utility and terminates the trial for efficacy. However, <ref type="bibr">Hall et al. (2011, p.57)</ref> concluded that "there is substantial uncertainty regarding the cost-effectiveness of Oncotype DX-directed chemotherapy" at the end of the trial and stated further research has to be done to collect more information about the cost effectiveness of the treatment. In particular, we consider the correct decision to be abandonment or continuation at the end of the trial in our simulations. Specifically, our results show that the majority of state variable paths in the diffusion approximation remain in the continuation region at the end of the trial with 300 patients. Notice that the results in Table <ref type="table">6</ref> agree with the previous results of Tables <ref type="table">2</ref> and<ref type="table">3</ref> where the simulation-based gridding algorithm performs best with respect to the estimated utility, whereas KG achieves the highest true utility. However, both methods are susceptible to incorrectly detect a significant response, and their probability of correct decision is much lower than the diffusion approximation method. Also note that in Tables <ref type="table">2</ref> and<ref type="table">3</ref>, simulation-based gridding and KG produce the lowest probabilities of correct decision and overestimate the utility significantly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>In this work, we studied the optimal stopping problem of an adaptive dose-finding clinical trial capable of terminating the trial for efficacy or abandoning it as a result of futility. We implemented a simulation-based gridding solution method and compared it with two  proposed methods in terms of solution quality and computational effort. Our first proposed method assumes that the next decision epoch is the last one (KG) and produces stopping decisions accordingly. Our second proposal considers a two-dose continuous version of the sampling and stopping problem and creates an It&#244; process for the state transition by which solving the continuous Bellman equation coincides with solving a partial differential advection-diffusion equation. We proposed a heuristic approach to extend the algorithm to multiple doses setting.</p><p>Our results show that if in the true dose-response curve the target dose has a significant advantage over the placebo, all three methods make a right decision in terminating the trial for efficacy; the KG policy stops sooner, followed by the simulation-based gridding, although the diffusion approximation requires more sampling epochs. Therefore, the estimate of the utility for the standard approach and the policy KG are higher than that of the diffusion approximation. However, if in the true dose-response curve the target dose does not have a significant advantage over the placebo, the gridding and KG methods perform extremely poorly in terms of the probability of correct decision and estimating the utility. In particular, these two methods decide on termination 90% of the time on average, although the correct decision is abandonment; that is, the error probability is 0.9 for these methods, which may have significant adverse consequences and is unacceptable for regulatory approvals. In fact, these two methods stop too early and significantly overestimate the benefits upon termination. By stark contrast, the diffusion approximation method produced abandonment decisions in 96% of the times in this setting, resulting in only 4% error, which shows that the diffusion method stops the trial much later, when it has enough evidence for making decisions.</p><p>Our results suggest that applying the standard method in a fully adaptive setting from early on, where a DM can stop or terminate the trial at each decision epoch, may have severe consequences when the correct decision is to abandon. Motivated by such observations, we proposed a modified stopping rule, where the stopping decision is activated only if the maximum posterior variance about the mean response &#920; falls below a threshold. In fact, max j Var[&#952; j |F ] is a metric that measures the uncertainty about the whole dose-response curve and is available to DMs at each decision epoch because it is a part of the state variable in the stopping problem. Our results show that using a constrained method significantly improves the performance of the simulationbased gridding algorithm and the KG policy.</p><p>Therefore, for recommendation purposes, the diffusion approximation method proved to be more robust with respect to different shapes of the underlying dose-response curve. In fact, based on the Food and Drug Administration (2018) estimates, only 33% of treatments clear Phase II of clinical trials, which shows that in the majority of trials, the underlying dose-response curve does not include a dose with a significant advantage over placebo. Therefore, our results suggest that the diffusion approximation method is potentially more accurate with respect to different scenarios. However, if strong evidence is available that a significant advantage over placebo exists, our results suggest that the KG policy   <ref type="bibr">, 2020</ref><ref type="bibr">, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS</ref> potentially stops sooner with respect to different dose-response curves and dose assignment procedures, and it is computationally less demanding than the standard gridding method. We also note that a potential extension of our methodology is to consider seamless Phase II/III trials where the probability of success at the end of Phase III can be integrated naturally in the utility function of the stopping problem.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Nasrollahzadeh and Khademi: Optimal Stopping of Adaptive Dose-Finding Trials</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>ServiceScience, 2020, vol. 12, No. 2-3, pp. 80-99, &#169; 2020 INFORMS   </p></note>
		</body>
		</text>
</TEI>
