<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Deep Spatial Q-Learning for Infectious Disease Control</title></titleStmt>
			<publicationStmt>
				<publisher>Springer</publisher>
				<date>07/08/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10467914</idno>
					<idno type="doi">10.1007/s13253-023-00551-4</idno>
					<title level='j'>Journal of Agricultural, Biological and Environmental Statistics</title>
<idno>1085-7117</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhishuai Liu</author><author>Jesse Clifton</author><author>Eric B. Laber</author><author>John Drake</author><author>Ethan X. Fang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Infectious diseases are a cause of humanitarian and economic crises across the world. In developing regions, a severe epidemic can result in the collapse of healthcare infrastructure or even the failure of an affected state. The most recent 2013-2015 outbreak of Ebola virus disease in West Africa is an example of such an epidemic. The economic, infrastructural, and human costs of this outbreak provide strong motivation for the examination of adaptive treatment strategies that allocate resources in response to and anticipation of the evolution of an epidemic. We formalize adaptive management of an emerging infectious disease spreading across a set of locations as a treatment regime that maps up-to-date information on the epidemic to a subset of locations identified as high-priority for treatment. An optimal treatment regime in this context is defined as maximizing the expectation of a pre-specified cumulative utility measure, e.g., the number of disease-free individuals or the estimated reduction in morbidity or mortality relative to a baseline intervention strategy. Because the disease dynamics are not known at the beginning of an outbreak, an optimal treatment regime must be estimated online, i.e., as data accumulate; thus, an effective estimation algorithm must balance choosing interventions that lead to information gain and thereby model improvement with interventions that appear to be optimal under the current estimated model. We develop a novel model-free algorithm for the online management of an infectious disease spreading over a finite set of locations and an indefinite or infinite time horizon. The proposed algorithm balances exploration and exploitation using a semi-parametric variant of Thompson sampling. We also introduce a graph neural network-based estimator in order to improve the performance of this class of algorithms. Simulations, including those mimicking the spread of the 2013-2015 Ebola outbreak, suggest that an adaptive treatment strategy has the potential to significantly reduce mortality relative to ad hoc management strategies.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Infectious diseases are a persistent and serious threat to public health worldwide <ref type="bibr">(Mathers 2008;</ref><ref type="bibr">Lozano et al. 2013;</ref><ref type="bibr">Bloom and Cadarette 2019)</ref>. Despite technological advances and increasingly vigilant biosurveillance, global rates of infectious diseases are not decreasing <ref type="bibr">(Smith et al. 2014</ref>). An effective real-time intervention strategy for an emerging infectious disease could have significant benefits, including reduction of mortality, morbidity, and healthcare costs. Consequently, the development of such a strategy is a priority for public health and security policy-makers <ref type="bibr">(Cecchine and Moore 2006)</ref>. We formalize such an intervention system as a treatment regime that maps the current status of the epidemic to a subset of locations identified as high-priority for treatment. An optimal treatment regime maximizes the mean of a pre-specified cumulative utility measure, e.g., the number of disease-free individuals throughout the epidemic.</p><p>Our goal is to find an optimal treatment regime for the management of emerging infectious diseases that, given the current outbreak information at different locations and resource constraints, identifies which locations should be prioritized for treatment. This problem is complicated by the following three issues: (i) spillover effects make the number of possible interventions an exponential function of the number of treatment units; i.e., treatment at one location can affect outcomes at other locations, so one must consider the joint treatment allocation across all locations. (ii) Disease dynamics are unknown at the time of the outbreak, so one must balance choosing interventions that lead to a significant information gain and subsequently an improved disease dynamic model, with choosing interventions that appear to be optimal based on current model estimates. (iii) Resource constraints impose additional restrictions on how and where interventions can be applied. One approach to estimating an optimal treatment regime is to posit a model for the disease dynamics and then to use simulation-based optimization to estimate an optimal treatment regime <ref type="bibr">(Carr and Roberts 2010;</ref><ref type="bibr">Kasaie and Kelton 2013;</ref><ref type="bibr">Nowzari et al. 2015;</ref><ref type="bibr">Hu et al. 2017;</ref><ref type="bibr">Laber et al. 2018a;</ref><ref type="bibr">Kompella et al. 2020;</ref><ref type="bibr">Guan et al. 2022)</ref>. If the posited model is low-dimensional and accurately reflects the disease process, this approach can be particularly effective early in the epidemic when data are scarce. However, such methods can perform poorly if the posited model is misspecified. An alternative is to construct a semi-parametric estimator of the optimal treatment regime that does not require a correctly specified dynamic model; examples of such estimators in non-spatiotemporal domains include regression-based estimators <ref type="bibr">(Murphy 2005;</ref><ref type="bibr">Henderson et al. 2010;</ref><ref type="bibr">Almirall et al. 2010;</ref><ref type="bibr">Zhao et al. 2011;</ref><ref type="bibr">Chakraborty and Moodie 2013;</ref><ref type="bibr">Schulte et al. 2014;</ref><ref type="bibr">Moodie et al. 2014;</ref><ref type="bibr">Kosorok and Moodie 2015;</ref><ref type="bibr">Laber et al. 2017;</ref><ref type="bibr">Ertefaie et al. 2021</ref>) and direct-search estimators <ref type="bibr">(Orellana et al. 2010;</ref><ref type="bibr">Rubin and van der Laan 2012;</ref><ref type="bibr">Zhang et al. 2012;</ref><ref type="bibr">Zhao et al. 2012;</ref><ref type="bibr">Zhang et al. 2013</ref><ref type="bibr">Zhang et al. , 2015;;</ref><ref type="bibr">Zhao et al. 2015;</ref><ref type="bibr">Zhou et al. 2017;</ref><ref type="bibr">Liu et al. 2018;</ref><ref type="bibr">Pan and Zhao 2020)</ref>. Thus, a natural approach is to apply a parametric simulation-optimization approach during the early stages of an epidemic, and subsequently migrate to a semi-parametric estimator as data accumulates. Our goal is to develop a class of online semi-parametric estimators that can be used in such a strategy. The class estimators that we propose is based on fitted Q-iteration (FQI; <ref type="bibr">Watkins 1989;</ref><ref type="bibr">Maei et al. 2010;</ref><ref type="bibr">Ernst et al. 2005;</ref><ref type="bibr">Ertefaie 2014;</ref><ref type="bibr">Riedmiller 2005</ref>) and Thompson sampling <ref type="bibr">(Thompson 1933)</ref>. A key challenge associated with extending reinforcement learning algo-rithms to spatio-temporal decision problems is using spatial information in a statistically and computationally efficient manner. To this end, we propose an automated (i.e., data-driven) feature construction algorithm based on graph neural networks <ref type="bibr">(Yan et al. 2006;</ref><ref type="bibr">Cai et al. 2018;</ref><ref type="bibr">Fey and Lenssen 2019;</ref><ref type="bibr">Ma et al. 2020</ref>) that succinctly summarizes local information which is then used in FQI. Our method is a variant of single-agent deep reinforcement learning, which has achieved great empirical successes in the past decade <ref type="bibr">(Riedmiller 2005;</ref><ref type="bibr">Mnih et al. 2015</ref><ref type="bibr">Mnih et al. , 2016;;</ref><ref type="bibr">Arulkumaran et al. 2017)</ref>. We note that our setup is related to but distinct from the cooperative multi-agent reinforcement learning problem <ref type="bibr">(Sunehag et al. 2017;</ref><ref type="bibr">Wang et al. 2020;</ref><ref type="bibr">Hernandez-Leal et al. 2019)</ref>. Whereas the cooperative multi-agent setting, involves a series agents learning locally, the optimal treatment regime in our setting is centralized with treatments being coordinated jointly across locations.</p><p>This work is motivated by our involvement in a retrospective study of the 2013-2015 outbreak of Ebola virus disease in West Africa <ref type="bibr">(Kramer et al. 2016a;</ref><ref type="bibr">Li et al. 2017)</ref>. The Ebola outbreak resulted in more than 10,000 deaths and the near-total collapse of healthcare infrastructure in affected areas (WHO Ebola Response Team 2014; <ref type="bibr">Hamel and Slutsker 2015)</ref>. We consider the daily allocation of treatments across 290 contiguous geopolitical regions. Our goal is to learn an optimal treatment regime that can be used to control future outbreaks by studying if and how the spread of the 2013-2015 outbreak could have been better controlled through adaptive treatment allocations subject to resource constraints. Both simulation and real data analysis results indicate that management strategies based on the proposed method can lead to significant reductions in the spread of the disease over ad hoc allocation strategies.</p><p>The rest of this paper is organized as follows. In Sect. 2, we review the 2013-2015 outbreak of Ebola virus disease. In Sect. 3, we define an optimal treatment regime under the framework of potential outcomes when the data-generating model forms a Markov decision process. In Sect. 4, we define a spatial FQI with graph embeddings and a semi-parametric variant of Thompson sampling. In Sect. 5, we review model-based policy search <ref type="bibr">(Laber et al. 2018a)</ref> for spatio-temporal problems. We illustrate the proposed methods using a suite of simulation experiments in Sect. 6 and a simulation of the spread of Ebola in West Africa in Sect. 7. Open problems are discussed in Sect. 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">EBOLA VIRUS</head><p>Ebola Virus Disease (EVD) is an acute hemorrhagic illness caused by a handful of viruses collectively known as the ebolaviruses. The 2013-2015 West Africa Ebola epidemic, caused by the Zaire ebolavirus, originated in the Gu&#233;ck&#233;dou Prefecture of Guinea, from which it spread to neighboring Liberia and Sierra Leone. A major outbreak resulted in more than 28,000 cases ensued, ignited small outbreaks in Nigeria, Mali, and the United States. Patients with EVD may exhibit a range of symptoms, including fever, muscular pain, vomiting, diarrhea, rash, organ failure, and death <ref type="bibr">(Feldmann and Geisbert 2011)</ref>. The overall case fatality rate of the West Africa epidemic exceeds 39%. Person-to-person transmission of Ebola is typically by exposure to infected body fluids. Although infectious cases are typically symptomatic, Ebola is difficult to contain without adequate and quickly Z. <ref type="bibr">Liu et al. 2014</ref><ref type="bibr">Liu et al. -05-10 2014</ref><ref type="bibr">Liu et al. -06-29 2014-08-18 -08-18</ref> Date of outbreak implemented infection control procedures. Additionally, the social disruption caused by Ebola outbreaks can make the scope of incipient outbreaks challenging to determine.</p><p>Several models for the spread of Ebola in West Africa have been constructed. <ref type="bibr">Rainsch et al. (2015)</ref> fit regression models to weekly data on the incidence of infection. They found that the case counts, population data, and distances between affected and unaffected areas were significant predictors of transmission. <ref type="bibr">Merler et al. (2015)</ref> developed an agent-based simulator that accounts for the early decline of spread in terms of the increasing availability of Ebola treatment units, safe burials, and distribution of household protection kits. Finally, <ref type="bibr">Kramer et al. (2016a)</ref> fit a coarse-grained gravity model to understand how the spread of the infection to new areas was affected by the attributes of donor and recipient regions. Their model considered only the first infection in a region to be of interest, and thus focused on the spread path. They found that the spread was best explained by the distances between source and recipient locations, population density, and border closures among neighboring countries (Figs. 1, 2 and 3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">MODEL</head><p>We consider a decision process evolving in discrete time, t = 0, 1, . . ., and across a finite set of locations L = {1, . . . , L}. In our application to EVD, the time points correspond to days, and the locations are geopolitical units. We assume that at each location and each time point t, the following sequence of events transpires: (i) A set of measurements is taken; these, together with past measurements, are summarized as the current state of the location, S t &#8712; S 0 &#8838; R m .</p><p>(ii) The decision maker selects a binary treatment A t &#8712; A 0 = {0, 1}. Thus, without any restrictions, there are 2 L possible treatment allocations at each time point.</p><p>(iii)We observe an outcome Y t &#8712; Y 0 .</p><p>Let S = S L 0 , A = A L 0 , and</p><p>We assume that Y t is contained in S t+1 (this is without any loss of generality as the state can always be expanded to ensure that this is the case). Let 2 A denote the power set of binary vectors A of length L. We assume that there exists a function &#968; : S &#8594; 2 A \ &#8709; such that &#968;(s) denotes the set of feasible treatments when the state is s. Further, we assume that there exists a function u 0 : Y 0 &#8594; R such that U t = u 0 (Y t ) is the utility with location and time t, and the total utility at time t is</p><p>The assumption of an additive utility model is a mild restriction in application as common measures are aggregated across locations in this way, e.g., number of infected locations etc.</p><p>A treatment regime in this context is a map &#960; : S &#8594; A which satisfies &#960;(s) &#8712; &#968;(s) for all s &#8712; S. Define &#960; (s) as the -th element of &#960;(s). A decision maker following &#960; applies treatment &#960;(s t ), i.e., applies &#960; (s t ) to location &#8712; L, if presented with S t = s t at time t. An optimal treatment regime maximizes the mean discounted utility if used to select treatments at each time point. We formalize this definition using potential outcomes <ref type="bibr">(Rubin 1974;</ref><ref type="bibr">Robins 1986</ref><ref type="bibr">Robins , 1987;;</ref><ref type="bibr">Splawa-Neyman et al. 1990;</ref><ref type="bibr">Tsiatis et al. 2019)</ref>. Let a t = (a 0 , . . . , a t ) and s t = (s 0 , . . . , s t ) denote the history up to time t. Let S * t (a t-1 ) be the potential state under treatment sequence a t-1 where S 0 (a -1 ) &#8801; S 0 , and let Y * t (a t ) be the potential outcome under treatment sequence a t . The potential state at time t under a regime &#960; is S * t (&#960; ) = a t-1 S * t (a t-1 ) t-1 v=0 1 &#960; {S * v (a v-1 )}=a v , where 1 &#957; is an indicator that evaluates to one if the clause &#957; is true and zero otherwise. Similarly, the potential outcome</p><p>Letting &#947; &#8712; (0, 1) be a fixed discount factor, the value of treatment regime &#960; is V (&#960; ) = E t 0 &#947; t u Y * t (&#960; ) so that the optimal regime, &#960; opt , satisfies V (&#960; opt ) V (&#960; ) for all &#960; . Note that our framework can be extended to other measures of cumulative utility, e.g., average utility or total utility over a finite horizon (see <ref type="bibr">Linn et al. 2017;</ref><ref type="bibr">Wang et al. 2018;</ref><ref type="bibr">Rowland et al. 2019</ref>, for additional discussion in non-spatial applications).</p><p>Let W * = S * t (a t-1 ), Y * t (a t ) : a t &#8712; {0, 1} L&#215;(t+1) t 0 denote the set of potential states and outcomes. To identify &#960; opt in terms of the data generating model, we impose a series of assumptions which are standard in the dynamic treatment regimes literature <ref type="bibr">(Murphy 2003;</ref><ref type="bibr">Robins 2004</ref>).</p><p>Assumption 1. We assume that for all t: (A1) Consistency:</p><p>In the context of online estimation, where treatment assignment is under the control of the decision maker, (A2) and (A3) can be ensured by construction. We note that these assumptions may not hold if the decision maker deviates from the recommendation of the estimated regime. We discuss this point further in Sect. 8. Hereafter, we assume that (A1)-(A3) hold implicitly.</p><p>In addition, it is standard in the context of dynamic treatment regimes to assume that there are independent replicates, e.g., patients in a study, that make the optimal regime nonparametrically identifiable. However, because of spatial interference <ref type="bibr">(Karwa and Airoldi 2018;</ref><ref type="bibr">Tec et al. 2022;</ref><ref type="bibr">Forastiere et al. 2021)</ref>, one cannot treat the locations as independent and consequently additional structure must be imposed on the model to identify &#960; opt <ref type="bibr">(Hudgens and Halloran 2008;</ref><ref type="bibr">Laber et al. 2018a)</ref>. We assume that the decision process, possibly transformed, is Markov and, under this assumption, impose a semi-parametric model on the conditional mean discounted utility given state and treatment. These modeling assumptions are standard in problems with an infinite or indefinite time horizon <ref type="bibr">(Powell 2007;</ref><ref type="bibr">Szepesv&#225;ri 2010;</ref><ref type="bibr">Hern&#225;ndez-Lerma and Lasserre 2012;</ref><ref type="bibr">Puterman 2014;</ref><ref type="bibr">Sutton and Barto 2018)</ref>. In particular, we assume that the states S t have been constructed so that the induced decision process is Markov, i.e., P S t+1 &#8712; B S t , A t = P S t+1 &#8712; B S t , A t with probability one for any (measurable) set B &#8838; S, and this probability does not depend on the time t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">SPATIAL FITTED-Q ITERATION</head><p>Under Assumption 1, the optimal treatment regime can be characterized using a recursive regression equation known as the Bellman optimality equation <ref type="bibr">(Bellman 1957;</ref><ref type="bibr">Maei et al. 2010;</ref><ref type="bibr">Hern&#225;ndez-Lerma and Lasserre 2012;</ref><ref type="bibr">Puterman 2014;</ref><ref type="bibr">Ertefaie and Strawderman 2018</ref>). For any s &#8712; S and a &#8712; &#968;(s), define</p><p>is the expected cumulative discounted utility starting state S t = s, applying treatment A t = a, and then following the optimal regime thereafter. It can be shown that &#960; opt (s) = argmax a&#8712;&#968;(s) Q(s, a) <ref type="bibr">(Bertsekas et al. 1995)</ref>. Furthermore, the Q-function</p><p>which, importantly, identifies the Q-function in terms of the data generating model. One approach to constructing an estimator of Q(s, a), and subsequently &#960; opt , is to use (1) to construct an estimating function to which Q(s, a) is a unique solution. For example, one might posit a linear model of the form Q(s, a; &#946; &#946; &#946;) = &#966;(s, a) &#946; &#946; &#946;, where &#966;(s, a) is a known feature vector, and &#946; &#946; &#946; is a vector of unknown coefficients; in this case, if there exists</p><p>which can be used to construct an estimator &#946; &#946; &#946; t of &#946; &#946; &#946; * given data observed through time t (see <ref type="bibr">Ertefaie and Strawderman 2018;</ref><ref type="bibr">Luckett et al. 2020;</ref><ref type="bibr">Saghafian 2021)</ref>.</p><p>The estimating equation approach to constructing an estimator of Q(s, a) is elegant due to its simplicity and direct connection to the Bellman optimality equations. However, it is not without drawbacks. One such drawback is that the form of the estimating equation requires limiting the class of models for Q(s, a) to those which are amenable to root-solving via standard software, such as smooth parametric models, or writing bespoke root-finding algorithms. A second, potentially more important drawback, is that the estimating equation formulation does not permit interactive model-building in the same way typical supervised learning does. This is critical when the model is being used to inform public health decisions in a pandemic. Thus, instead of considering the estimating equation characterization to construct an estimator, we use the fact that Q(s, a) is the fixed point of the so-called Bellman optimality operator to express the optimal policy as the limit of a series of regressions.</p><p>Define the Bellman optimality operator B, which acts on functions w :</p><p>The Bellman optimality equation is thus succinctly expressed as Q = BQ. Moreover, Q is the unique fixed point of the operator B, which is a contraction in the sup-norm. The associated fixed-point iteration algorithm known as value iteration is given by the update</p><p>1, where the initial value, Q 0 is initialized arbitrarily; under mild regularity conditions, Q k converges geometrically to Q <ref type="bibr">(Bertsekas and Tsitsiklis 1996)</ref>. This characterization leads to an iterative algorithm for the semi-parametric estimation of the optimal Q-function given observations from the underlying decision process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">FITTED Q-ITERATION</head><p>Fitted Q-iteration is a regression-based approximation of the value iteration algorithm <ref type="bibr">(Ernst et al. 2005;</ref><ref type="bibr">Riedmiller 2005;</ref><ref type="bibr">Busoniu et al. 2010)</ref>. We first describe the algorithm without exploiting any underlying spatial structure. Let Q denote the posited class of the Q-functions. At time t, let Q t,0 &#8801; 0, and for k 1, let</p><p>which can be viewed as an application of approximate value iteration in which the empirical distribution has been used in place of the true expectation. The estimated optimal regime is &#960; t (s) = argmax a&#8712;&#968;(s) Q t,K (s, a), where K is the desired number of iterations. Note that the least squares estimator in (2) can be constructed using essentially any regression estimator, e.g., trees, neural networks, Gaussian Process models, and so on. Furthermore, these models can be built interactively at each step to avoid severe misspecification.</p><p>When the process under study is a disease spreading across a network, the number of locations may far exceed the number of time points, e.g., if L are individuals in social network. Furthermore, the effects of treatment are likely to be local in that the greatest impact will be on those to whom treatment is applied and their close contacts. With this in mind, we now show how the generic fitted Q-iteration algorithm can be extended to pool information across locations and thus increase efficiency. We do this by showing that the Q-functions defined in the preceding section, {Q k } k 0 , can be expressed as sums of 'location-specific Q-functions' which can be estimated using local data thereby increasing the number of observations in (2) from t to t &#215; L.</p><p>The first (non-trivial) Q-function is</p><p>where</p><p>where q 2 (s, a)</p><p>which is a sum over local Q-functions q k+1 (s, a)'s. We note that the Q-function in equation (1) corresponds to Q &#8734; here.</p><p>Let be a class of maps from S &#215; A into R J such that for any &#966; &#8712; , we have &#966;(s, a) = &#966; 1 (s, a), . . . , &#966; L (s, a) , where &#966; (s, a) &#8712; R J is a feature vector for location constructed from (s, a). Let Q be a class of maps from R J into R; choices for and Q are discussed in the next section. We posit working models for q k (s, a) of the form q k (s, a) = q k &#966; (s, a) , where q k &#8712; Q and &#966; &#8712; . The fitted Q-iteration algorithm with this class of models is thus comprised of the following steps. At time t, the first (non-trivial) iteration of the fitted-Q algorithm is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">Spatial Fitted Q-Iteration with Graph Embeddings</head><p>Graph embeddings have been widely studied and used to great empirical success in analyzing structured data such as text (e.g., <ref type="bibr">Yan et al. 2006;</ref><ref type="bibr">Mikolov et al. 2013;</ref><ref type="bibr">Cai et al. 2018)</ref>. They have also been used to estimate optimal policies in single-stage decision problems on networks <ref type="bibr">(Ma et al. 2020)</ref>. We consider a supervised approach in which the embeddings are adaptive, i.e., data-driven, to improve the quality of the estimated Q-functions. To the best of our knowledge, the construction used here is new and may be of independent interest.</p><p>Our goal is to construct a class of features : S &#215; A &#8594; R L&#215;J so that each &#966; &#8712; creates a vector of local-summaries of (s, a) &#966;(s, a) = &#966; 1 (s, a), . . . , &#966; L (s, a) , one for each location, that is amenable to estimation as described in the preceding section. Recall that s &#8712; R m . Let H be a class of functions mapping R m+1 into R J , and let G be a class of functions mapping from R J &#215; R J into R J . In our implementation, we consider each of these to be feed-forward neural networks <ref type="bibr">(Bebis and Georgiopoulos 1994)</ref>, though other choices are possible. For given h &#8712; H and g &#8712; G, we construct the embedding of location at time t as follows. Define f (1) : R m+1 &#8594; R J as f (1) (b; g, h) = h(b) for all b &#8712; R m+1 ; thus, f (1) (s , a ; g, h) = h(s , a ) is a summary of the state and treatment at location . Let</p><p>We consider all permutations to make the embedding invariant to the order in which locations are processed. Let N &#8838; {1, . . . , L} be a neighorhood of location , e.g., all m-order neighbors or the locations within some pre-specified distance.</p><p>If the set N is large, computing all possible permutations of its elements may be computationally infeasible. In this case, one can sample permutations uniformly at random. The collection of feature maps is thus</p><p>This construction can be understood as comprising the following steps: (i) computing a basis expansion h &#8712; H; (ii) recursively applying a binary operation, g &#8712; G, to this basis expansion along all orderings of elements in a given neighborhood of and taking the average of the results; and (iii) applying this binary operation to the features at and this average. This construction has a number of advantages. First, it can process a neighbor set of any size. Second, the learned feature vector of location is invariant to the ordering of its neighbors in N . Finally, it considers the interaction of all neighbor pairs through the binary operation. We note that there are alternative constructions such as message passing graph neural networks which possess the first two properties <ref type="bibr">(Lee et al. 2019;</ref><ref type="bibr">Fey and Lenssen 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">GENERALIZED BOOTSTRAP AND THOMPSON SAMPLING</head><p>Methodologically, online sequential decision problems reside at the intersection of two disciplines: (i) sequential optimal design which focuses on choosing treatments to maximize information gain and thereby create high-quality estimated models <ref type="bibr">(Chernoff 1972;</ref><ref type="bibr">Atwood 1973;</ref><ref type="bibr">Lai and Wei 1982;</ref><ref type="bibr">Bartroff et al. 2012)</ref>; and (ii) dynamic programming which seeks to efficiently derive an optimal treatment regime using the estimated models <ref type="bibr">(Bellman 1957;</ref><ref type="bibr">Powell 2007;</ref><ref type="bibr">Busoniu et al. 2010;</ref><ref type="bibr">Sutton and Barto 2018)</ref>. As the primary goal is maximizing the discounted cumulative reward, the goal of many algorithms for online decision making is to experiment, i.e., deviate from the current estimated optimal treatment regime, only if and when such experimentation is likely to produce information that pays dividends in terms of long-term performance. The need to judiciously balance experimentation for information gain and optimizing for immediate utility gain is known as the exploration-exploitation trade-off in computer science. A common strategy to ensure sufficient information gain is to force exploration either through randomization or by maximizing at each step an objective function that includes an additional term for information gain <ref type="bibr">(Pronzato 2000;</ref><ref type="bibr">Auer 2000</ref>; Russo and Van Roy 2014; Lattimore and Szepesv&#225;ri 2020).</p><p>We consider a randomized treatment allocation strategy that can be viewed as a semiparametric variant of the celebrated Thompson sampling. This work is based on resampling or purturbation in (non-spatial) decision problems (e.g., see <ref type="bibr">Eckles and Kaptein 2014;</ref><ref type="bibr">Fortunato et al. 2017;</ref><ref type="bibr">Plappert et al. 2017;</ref><ref type="bibr">Osband et al. 2019</ref>). Thompson sampling <ref type="bibr">(Thompson 1933)</ref> was originally proposed as a Bayesian approach to exploration wherein at each time point one draws a model from the posterior given current data and then selects the optimal regime assuming the selected model is correct <ref type="bibr">(Scott 2010;</ref><ref type="bibr">Agrawal and Goyal 2011;</ref><ref type="bibr">Kaufmann et al. 2012;</ref><ref type="bibr">Agrawal and Goyal 2013;</ref><ref type="bibr">Korda et al. 2013;</ref><ref type="bibr">Gopalan et al. 2014;</ref><ref type="bibr">Hu et al. 2017</ref>). However, Thompson sampling cannot be directly applied in this form as we have not specified a model for the complete system dynamics; instead we use the more general concept of a confidence distribution in which the sampling distribution is used as a kind of surrogate for the posterior and used to compute probabilities (confidence levels) of interest <ref type="bibr">(Xie and Singh 2013)</ref>. We use the fact that Q 0 , Q 1 , . . . , Q K are M-estimators and thus use a multiplier bootstrap appropriate for Markov processes to estimate the sampling distributions of these estimators <ref type="bibr">(Jin et al. 2001;</ref><ref type="bibr">Chatterjee and Bose 2005;</ref><ref type="bibr">Minnier et al. 2011)</ref>. In particular, let W t,0,1 , . . . , W t,t-1,L denote i.i.d. random variables each with unit mean and unit variance, and define </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">OPTIMIZATION</head><p>Finding a maximizer of the estimated Q-function involves searching a combinatoriallylarge space, which is intractable for moderately large L. In order to approximate argmax a&#8712;&#968;(s) Q t,k (s, a), we maximize a quadratic approximation, yielding a tractable binary quadratic program. In particular, we introduce the approximation</p><p>Each each time t, iteration k, and location , we compute</p><p>where a i , i = 1, . . . , I is a collection of allocations in A; in our experiments we generate a i uniformly at random. For a given treatment budget &#961;, the approximate maximizer of</p><p>(3) Problem ( <ref type="formula">3</ref>) is a binary quadratic program which can be solved efficiently with off-the-shelf solvers such as Gurobi (Bixby 2007).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">ESTIMATING THE OPTIMAL STRATEGY VIA POLICY-SEARCH</head><p>The primary alternative to Q-learning that we consider in our simulation experiments is model-based policy-search. Thus, we briefly review it here. In model-based policy-search, one estimates a model for the underlying system dynamics, e.g., the disease model, and then uses the fitted model to identify the optimal strategy within a pre-specified class via Monte Carlo <ref type="bibr">(Laber et al. 2018a)</ref>.</p><p>For the purpose of illustration, we consider a class of candidate strategies of the form = {&#960;(&#8226;; &#952;) : &#952; &#8712; } with &#8834; R p compact and</p><p>where &#981; is a known feature vector.</p><p>Computing the value function, V (&#960; ), in closed form is not possible in general. Thus, we estimate the value function using Monte Carlo approximation applied over a finite horizon.</p><p>) S 0 = s be a surrogate for the value function; it follows under mild moment conditions that V (&#960; ) &#8776; V T (&#960; ) when T is large <ref type="bibr">(Bertsekas 2007)</ref>. To construct as estimator V T (&#960; ) of V T (&#960; ), we postulate a model for the system dynamics. As the process is assumed to be Markov and homogeneous, it is completely determined by the transition kernel. We posit a parametric model for this kernel, &#954;(s |s, a; &#946;), which is indexed by &#946; &#8712; B &#8838; R q ; thus, under this model with parameters &#946; we have</p><p>where &#955; is a dominating measure. The T -step value function for strategy &#960; under parameter vector &#946; &#946; &#946;, starting from state s is thus equal to</p><p>where &#955; is a dominating measure, f (s 0 |s -1 , a -1 , y -1 ) is taken to be a point mass at s, and we have used the fact that Y t is S t+1 measurable so that we can write u(Y t ) = u(S t+1 ).</p><p>We use Thompson Sampling to balance exploration and exploitation when estimating the optimal strategy. This yields the following algorithm which executes at each time point: (i) obtain a posterior distribution for &#946; using the observed data (at time t = 0 this is taken to be the prior); (ii) draw &#946; from the posterior; and (iii) estimate the optimal strategy under the belief that &#946; indexs the true underlying model. The policy search estimator of the optimal strategy at each time point is thus</p><p>induced by the sampling variability of &#946;.</p><p>When the model is not severely misspecified, the model-based policy-search based on a low-dimensional dynamics can often yield better estimates of the optimal policy than model-free approaches when data are scarce, i.e., early in the decision process. However, as we illustrate in the next section, model-based methods can be highly sensitive to misspecification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">SIMULATION EXPERIMENTS</head><p>In our experiments, we consider a replicating agent spreading over a network according to a susceptible-infected-susceptible (SIS) model <ref type="bibr">(Weiss and Dishon 1971)</ref>; this model was chosen in part because it allows for correct specification for the model-based estimators that we use as a baseline for comparison with our proposed method. Under the SIS model, each location transitions among the three states: susceptible, infected, and susceptible. Infection spreads from infected to susceptible locations. As soon as a location has been infected, it has the potential to recover from the disease. Once a location recovers, it immediately becomes susceptible to the disease again (see <ref type="bibr">Keeling and Eames 2005</ref>, for discussion about epidemic models and references).</p><p>We let Y t &#8712; {0, 1} denote the infection status of location at time t, i.e., Y t = 1 if location is infected at time t and zero otherwise. Define I t = { &#8712; L : Y t = 1} to be the set of infected locations at time t and I c t its complement. With each location is an associated covariate x t &#8712; R. The state at time t is thus S t = (X t , Y t-1 ). The evolution of the state is governed by the following models</p><p>(5) where &#966;(&#8226;) is the probability density function for the standard normal distribution,</p><p>and logit p ,0 (s, a;</p><p>Thus, the model is indexed by &#946; = (&#957; 0 , &#957; 1 , &#951; 0 , . . . , &#951; 6 ) . It can be seen that under this model treating an uninfected location reduces the probability that it becomes infected, while treating an infected location has the dual effect of reducing its likelihood of transmitting infection to adjacent uninfected location and accelerating its recovery.</p><p>To study the effects of model misspecification, we introduce a contamination model g Contam for the conditional infection probabilities. For each &#8712; [0, 1], we define a contaminated model b whose infection probabilities are</p><p>Our contamination model g Contam is based on a "shield-state" variant of the SIS model in which infection probabilities are mediated by the covariate x (in particular, the indicator 1 {x &#8804; 0}), modified such that the transmissions of infection to a location from its neighbors are no longer independent. More details of the contamination model are in the Supplemental Materials. Thus, in the case of full contamination ( = 1), the SIS model defined in equations ( <ref type="formula">5</ref>)-( <ref type="formula">7</ref>) is severely misspecified. Meanwhile, a logistic regression model with sufficiently expressive features &#981; is approximately correctly specified in each case, though with high variance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">EXPERIMENT SETUP</head><p>Parameters indexing the generative models are tuned to have specified infection rates using a network of size four. Thus, in tuning these parameters we have L = {1, 2, 3, 4} and Figure <ref type="figure">3</ref>. An instance of the "lookahead" network structure, in which the location labeled A is infected. A myopic strategy with a budget of &#964; = 1 might treat location A in order to minimize the expected number of infections at the next step, a non-myopic strategy would treat location B as it has more neighbors and therefore would cause more infections if infected. In our experiments, a lookahead network of size L consists of L/7 repetitions of the structure.</p><p>we assume that location 1 is a neighbor of all other locations. Each equation is given in terms of location 1. The rates are defined as</p><p>where &#8226; represents any value and + represents a positive value. In particular, these equations set the latent probability of infection without treatment to 0.01; the probability of infection when 3 neighbors are infected without treatment to 0.5; preventative treatments to reduce the probability of infection by a factor of 0.75 when three neighbors are infected and none of which are treated; active treatments to reduce the probability of infecting a neighbor by a factor of 0.25 assuming only three infected neighbors and all of which are treated; the base probability of recovery with no treatment to 0.25; and the probability of not recovering to decrease by a factor of 0.5 with a treatment.</p><p>We present simulation results from rolling out each of the strategies under consideration for T = 25 time steps on lattice, random nearest neighbor, and custom network structures with L = 100 locations, with contamination parameters &#8712; {0, 0.5, 1}. We provide more details in the Supplemental Materials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">FEATURE CONSTRUCTION FOR MODEL-FREE ESTIMATION IN SIS ENVIRONMENTS</head><p>We consider classes of Q-functions which are linear in (1) the raw features at each location (i.e., infection status, treatment status, and any covariates); (2) handcrafted features designed to efficiently encode information about each locations' neighbors; and (3) learned features using the graph neural network as described in Sect. 4.1.1. We describe the construction of the handcrafted features here; details on the graph neural network implementation such as the number of neurons, tuning procedures, etc., can be found in the Supplemental Materials.</p><p>In order to pool information across individual locations in the network, we construct features at each location, which contain both that location's state and treatment status as well as its neighbors' state. In our case, we define the binary covariate as &#953; 1 x &#8804; 0 ; this is informed by the shield-state variant of the SIS model mentioned above, in which infection probabilities are mediated by the value of &#953; . We summarize the features of a location's neighbors using binary encodings. Let N k be the set of paths of length k beginning with location , and for each r k &#8712; N k with covariates {(&#953; (1) , a (1) , y (1) ), . . . , (&#953; (k) , a (k) , y (k) )}, let b(r k ) = k i=1 2 3(i-1) &#953; (i) + 2 3(i-1)+1 a (i) + 2 3(i-1)+2 y (i) + 1. Then, writing the j th basis vector in R 2 3k as e j , the handcrafted feature of location is given by r k &#8712;N k e b(r k ) ; that is the vector of counts of each feature combination on each path of length k from a location .</p><p>Denoting &#8746; as the vector concatenation, we write the feature function for each location as</p><p>where &#953; a y is the binary vector of length 8 corresponding to (&#953; , a , y ), and &#8746; means vector concatenation. For example, the first-order neighbor feature vector is of length 16 and the second-order neighbor feature vector is of length 72.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">EXPERIMENT RESULTS</head><p>We compare the following learning algorithms:</p><p>-Model-free estimators of Q 1 , using either a graph neural net work architecture, a linear architecture with raw features, or a linear architecture with the hand crafted features described in Sect. 6.2. We refer to these as M-g, M-r, and M-h, respectively ("M" represents myopic);</p><p>-Model-free estimators of Q 2 , which we referred to as F-g-2, F-r-2, and F-h-2 ("F" represents FQI since these entail applying two steps of spatial FQI);</p><p>-Model-free estimators of Q 3 , which we referred to as F-g-3, F-r-3, and F-h-3;</p><p>-The policy search (PS) algorithm of Laber et al. (2018a) (Sect. 5), using Bayesian optimization (implemented in the BayesOpt package in Python <ref type="bibr">(Nogueira 2018</ref>)) to carry out the requisite optimizations.</p><p>Additionally, we compute the average performance of the random strategy, which chooses a subset of &#961; locations to treat. We also estimate the performance of two "oracle" strategies. The first is the model-based policy search learning algorithm described above, except in this case rollouts are conducted with the exact (and correctly specified) model, rather than an estimate (OPS). The second oracle strategy conducts one step of FQI using the true infection probability (TP). (See the Supplemental Materials for more details on these algorithms.)</p><p>Let be a learning algorithm, i.e., : dom H t &#8594; A, where H t is a history of observations. Define the utility to be the total number of infected locations at time t, i.e., u( y t ) = L =1 y t . Then let V ( ; &#946;) be the expected cumulative discounted utility over T = 25 time steps incurred by learning algorithm when the generative model is given by &#946;:</p><p>Figure <ref type="figure">4</ref> displays the normalized mean infection counts under each learning algorithm and each SIS environment considered. There are several features of these results worth noting:</p><p>&#8226; For the myopic and FQI learning algorithms, the hand-crafted features and the neural network features improve performance significantly relative to using the raw features. This result is anticipated by the literature emphasizing the importance of feature construction in reinforcement learning <ref type="bibr">(Song et al. 2016)</ref>. It is encouraging that the neural network features perform better than the handcrafted features based on the structure of the true generative model.</p><p>&#8226; Given the structure of these networks, FQI learning algorithms with handcrafted features or neural network architecture outperform or have similar performance to their myopic counterparts. Because of the dense connection among the locations in the lattice and nearest neighbor network, FQI learning algorithms with neural network architecture have a significant advantage compared with the random strategy. As the connections in the custome network are very spare, the decrease in normalized mean infection is relatively small. It is worth noting that the FQI with neural network architecture is consistently among the best.</p><p>&#8226; As expected, model-based policy-search performs well when the model is correctly specified or moderately misspecified, but its performance deteriorates significantly as the degree of misspecification increases (brittleness to model misspecification is also shown in <ref type="bibr">Rose et al. (2019)</ref>).</p><p>&#8226; FQI learning algorithm with neural network architecture is the most stable method in that it is robust to the network type and model contamination. It is the only strategy that outperforms the random strategy in all settings. Under the scenarios with = 1, its performance is similar to the oracle myopic strategy and is better than the oracle policy search strategy.</p><p>More results can be found in the Supplemental Materials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">MANAGEMENT OF THE EBOLA VIRUS</head><p>Demonstrating our method with Ebola virus disease requires some changes to our setup. First, the dynamic model for Ebola is taken from <ref type="bibr">Kramer et al. (2016a)</ref>. This model is called the gravity model, and the state-information is constant and given by the population of each location n as well as the distances between each location d , . We take the neighbors N of each location to be the four locations with the smallest values of d , /n n . Under the gravity model there is no recovery from infection, and the probability of transmission of infection from an infected location to an uninfected location is defined by</p><p>This model acquires its name from the second term, know as the gravity term. The numerator is the distance between two locations and it is normalized by the product of the populations in each location in the denominator. To stabilize the estimation of the model, we force the coefficient on the gravity term and the exponent on the population product to both be positive.</p><p>We fit the gravity model with no treatment terms to the observed infection data from the 2013-2015 Ebola epidemic to obtain &#951; MLE 0 , &#951; MLE 1 , &#951; MLE 2 , and tuned the generative model used in our experiments to meet two conditions. In the first, we took &#960; 1 to be the policy that applies no treatment and considered models with parameters of the form &#951;(&#964; 1 ) = {&#964; 1 &#951; MLE 0 , log(&#964; 1 ) + &#951; MLE 1 , &#951; MLE 2 , 0.0, 0.0}. We then choose &#964; 1 such that</p><p>where E {&#960;,&#951;} refers to the expectation taken with respect to trajectories in which the dynamics are given by the gravity model with parameter &#951; and strategy &#960; is followed). In the second, we took &#960; 2 to be the greedy policy under the true generative model; &#960; 3 to be the uniformly random policy with budget 0.15L ; and, considering models with parameters of the form &#951;(&#964;</p><p>over a grid of values between 0 and 10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">FEATURE CONSTRUCTION FOR MODEL-FREE ESTIMATION IN GRAVITY ENVIRONMENT</head><p>As before, in addition to the graph neural net-based policies described above, we use a linear Q-function architecture with hand-crafted features. We append each location's population and action and infection status, i.e., n , a , y , to each of its neighbor's features. We also include the distances of a location to each of its neighbors. Thus the feature vector at location and time t is (using &#8746; to denote vector concatenation): &#981; (s t , a t , y t ) = n a t y t &#8746; &#8712;N n a t y t d , .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">RESULTS</head><p>Figure <ref type="figure">5</ref> displays the results for each of the learning algorithms described above (in the case of policy search, assuming correct specification of the model). We find that Q 3 with graphical neural network architecture outperforms other learning algorithms. Q 2 with handcrafted features, Q 2 with neural network architecture, and Q 1 with handcrafted features have similar performance, which are very close to the oracle TP strategy. Interestingly, FQIs with graphical neural network architecture generate better performance than the model based policy search algorithm with correct specified infection model. We conclude that our methods can generate effective real-time intervention strategies, which lead to significant reductions in the spread of EVD compared to random allocation strategies. The learned strategies can provide adequate and in time implemented infection control procedures from incipient outbreaks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">CONCLUSION</head><p>We develop a semiparametric (model-free) approach to the online control of an emerging infectious disease. In simulation experiments, this approach provided better control of an infectious disease than ad-hoc strategies and was robust to certain kinds of model misspecification compared to the model based policy search approach.</p><p>There are a number of important and interesting open problems associated with spatiotemporal decision-making. Disease surveillance data can be noisy, sparse, or incomplete. Extending the proposed methods to accommodate such data, perhaps by generalizing the underlying Markov decision process framework to a partially observable Markov decision process (see, e.g., <ref type="bibr">Ross et al. 2008)</ref>, could potentially improve solution quality. While we have proposed an online methodology, the updates to the policy estimator require significant computation time. While this is acceptable for settings where decisions operate on a scale of days, it would be desirable for decision support systems operating on a finer time scale or deployed on CPU-limited devices (e.g., mobile phones) to develop estimators that can be updated in linear time.</p><p>A set of critical open problems involves incorporating decision strategies such as those presented here into a broader decision-support system. Estimated optimal treatment strategies for human diseases are intended to inform, not dictate, decision-making. Thus, communication and visualization tools are needed to assist decision makers in using data-driven management strategies like the one proposed here. Additionally, if the recommendations of the estimated strategy are only one input into the decision-making of the relevant actors-in particular, if decision makers sometimes deviate from the recommendations on the basis of some additional information-then the sequential decision-making model developed in Sect. 3 no longer adequately models the decision process. (For instance, the assumption of strong ignorability (A3) may be violated.) The development of tools for reinforcement learning and causal inference, which more fully account for the fact that reinforcement-learning based decision support tools will likely be only one input into the final decisions made regarding, say, the control of an epidemic, remains an essential and fascinating unsolved problem.</p><p>One potential drawback of the proposed method is its black-box nature. It is difficult to delineate the information used to construct the graph embeddings and the structure of interference. The development of interpretable methods that provide additional insights into the learned optimal strategy is an important area for future work.</p><p>[Received <ref type="bibr">November 2022</ref><ref type="bibr">. Revised May 2023</ref><ref type="bibr">. Accepted May 2023.]</ref> </p></div></body>
		</text>
</TEI>
