<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Data‐driven decision‐focused surrogate modeling</title></titleStmt>
			<publicationStmt>
				<publisher>Wiley</publisher>
				<date>04/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10512376</idno>
					<idno type="doi">10.1002/aic.18338</idno>
					<title level='j'>AIChE Journal</title>
<idno>0001-1541</idno>
<biblScope unit="volume">70</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Rishabh Gupta</author><author>Qi Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce the concept of decision‐focused surrogate modeling for solving computationally challenging nonlinear optimization problems in real‐time settings. The proposed data‐driven framework seeks to learn a simpler, for example, convex, surrogate optimization model that is trained to minimize the decision prediction error, which is defined as the difference between the optimal solutions of the original and the surrogate optimization models. The learning problem, formulated as a bilevel program, can be viewed as a data‐driven inverse optimization problem to which we apply a decomposition‐based solution algorithm from previous work. We validate our framework through numerical experiments involving the optimization of common nonlinear chemical processes such as chemical reactors, heat exchanger networks, and material blending systems. We also present a detailed comparison of decision‐focused surrogate modeling with standard data‐driven surrogate modeling methods and demonstrate that our approach is significantly more data‐efficient while producing simple surrogate models with high decision prediction accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>and the surrogate optimization problem achieve very different optimal solutions despite having a highly accurate embedded surrogate model.</p><p>In this article, we propose a new surrogate modeling framework that explicitly aims to construct surrogate models that minimize the decision prediction error defined as the difference between the optimal solutions of the original and the surrogate optimization problems; we hence refer to it as decision-focused surrogate modeling. We take a datadriven approach in which the original optimization problem is solved offline with different model inputs, resulting in a dataset where each data point is an input-decision pair. We then develop an inverse optimization <ref type="bibr">1</ref> approach that directly learns from the given data a surrogate optimization model that has a simpler (e.g., convex) form and minimizes the decision prediction error subject to the restrictions on the form of the surrogate model. Results from multiple computational case studies show that the proposed approach outperforms alternative surrogate modeling approaches in decision accuracy, data efficiency, and/or computational efficiency of the resulting surrogate optimization model.</p><p>In the remainder of this article, we first provide a systematic review of the main existing surrogate modeling frameworks and other related works in Section 2. We introduce the concept of decision-focused surrogate modeling, provide the corresponding mathematical problem formulation, and present a solution algorithm in Section 3. Several numerical case studies based on typical nonlinear chemical engineering systems are presented in Section 4. Finally, we conclude in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">| BACKGROUND AND RELATED WORK</head><p>In this section, we present an overview of the major surrogate modeling frameworks and other related work. We formally describe the different approaches to make clear the main conceptual differences, which will help us motivate the proposed decision-focused surrogate modeling framework and highlight its unique features in Section 3.</p><p>We focus on data-driven approaches but make reference to other related methods wherever appropriate.</p><p>Without loss of generality, we assume the original optimization problem to be of the following form: minimize x X f&#240;x, u&#222; subject to g&#240;x, u&#222; &#8804; 0, &#240;1&#222; where x are the decision variables and X &#8838; &#8477; n . We assume that the input parameters u can be chosen from a given set U &#8838; &#8477; p . The constraints describing the feasible region of (1) are defined by the functions g : &#8477; n &#194; &#8477; p ! &#8477; k , and f : &#8477; n &#194; &#8477; p ! &#8477; is the objective function of the problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">| Optimization with embedded surrogate models</head><p>We first describe the traditional surrogate modeling framework as outlined in Section 1, which we call optimization with embedded surrogate models. To explain the main idea of this approach, we rewrite (1) as follows: minimize x X f&#240;x, u&#222; subjectto h&#240;x, u&#222; &#8804; 0 d&#240;x, u&#222; &#8804; 0, &#240;2&#222; where we divide the constraint functions g into two sets of functions h and d. This surrogate modeling approach is typically applied when problem (2) is such that its computational complexity mainly stems from the functions d; hence, the goal is to replace them with a simpler set of functions d. Such surrogate models are typically generated using simplifying assumptions based on physical and engineering insights, model order reduction techniques, or data-driven methods. <ref type="bibr">2</ref> In a data-driven approach, one first generates a set of N data points where for each point i I &#188; f1, &#8230;, Ng, x i and u i are sampled from X and U, respectively, and the corresponding function evaluations</p><p>Assuming that d&#240;&#193;&#222; can be chosen from the set of functions D, one then solves the following empirical risk minimization problem: minimize</p><p>where &#8467;&#240;&#193;, &#193;&#222; denotes a loss function, for example, the Euclidean distance, that is a measure of the difference between the observation and the estimate.</p><p>Once we have obtained d, the surrogate model for d, we can formulate the following surrogate optimization model: There exist a large number of statistical learning methods that are used to train these embedded surrogate models; popular choices in PSE include response surface methods, <ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref> kriging (or Gaussian process regression), <ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref> and deep learning. <ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref> Data-driven approaches are often combined with first-principles modeling, resulting in gray-box models. For many physical systems, gray-box models have proven to perform better in terms of model accuracy and interpretability compared to purely data-driven models; hence, first-principles elements are often incorporated as long as they are not overly complex. <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref> A key challenge in surrogate modeling is the balance between model accuracy and computational efficiency; recent efforts aim at generating surrogate models composed of the simplest functional forms possible, given a desired model accuracy, to facilitate their use in mathematical optimization. A prominent example of such an approach is the ALAMO framework proposed by Sahinidis et al. <ref type="bibr">15,</ref><ref type="bibr">16</ref> The constraints in an optimization problem define the feasible region for the variables of the problem; thus, generating surrogate constraint functions can also be interpreted as creating a surrogate representation of the feasible region. Approaches that are directly derived from this interpretation of surrogate modeling consider datasets that consist of feasible and infeasible points. With that, constructing a surrogate model essentially becomes a binary classification problem. Existing works have performed this kind of surrogate modeling using classical classification approaches such as support vector machines (SVMs). <ref type="bibr">17,</ref><ref type="bibr">18</ref> The intricate shapes of the feasible regions arising in many process systems applications have further motivated the development of more complex geometry-based approaches. Ierapetritou <ref type="bibr">19</ref> proposes to approximate a convex feasible region with the convex hull of a set of points sampled at the boundary. For nonconvex models, Goyal and Ierapetritou <ref type="bibr">20</ref> develop an approach in which the feasible region is approximated by subtracting outer polytopes around the infeasible region from an expanded convex hull obtained from simplicial approximation. <ref type="bibr">21</ref> The nonconvex case has also been addressed using shape reconstruction techniques. <ref type="bibr">22</ref> Zhang et al. <ref type="bibr">23</ref> propose an algorithm for constructing a union of multiple polytopes that approximates the nonconvex feasible region from which the data points are sampled. In a similar spirit, Schweidtmann et al. <ref type="bibr">24</ref> apply persistent homology, a topological data analysis method, to identify potential holes and clusters in the data; this information is then used to obtain a representation of the nonconvex feasible region using one-class SVMs. In another approach, feasibility is captured using a so-called feasibility function, <ref type="bibr">25</ref> and data-driven approaches are applied to approximate that function; see Bhosekar and Ierapetritou <ref type="bibr">26</ref> for a comprehensive review of this type of methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">| Learning optimization proxies</head><p>The tremendous advances in machine learning, especially deep learning, have enabled the modeling of highly complex relationships for accurate prediction. Recently, this has also spurred a growing interest in learning directly the mapping from the input parameters of an optimization problem to its optimal solution. In this setting, the required dataset is f&#240; u i , x * i &#222;g i I , where x * i denotes an optimal solution to the original problem (1) for the input u i . The goal is to learn a model m&#240;&#193;&#222; M that returns an optimal (or near-optimal) solution for a given input by solving the following expected risk minimization problem:</p><p>The corresponding surrogate model is simply</p><p>which arrives at the solution through a function evaluation rather than by solving an optimization problem. Thus, m serves as a computationally efficient proxy for the original optimization model. Importantly, as indicated by the loss function in problem (5), this approach explicitly aims to minimize the decision prediction error. This is in contrast to the methods reviewed in Section 2.1, which minimize the prediction error for individual constraint function values but cannot provide any guarantees in terms of how it affects the decision prediction error.</p><p>Although, in theory, any type of machine learning model (specified through M) can be used in problem (5), the preferred choice in the literature has been deep neural networks. <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref> A major challenge in deep learning is the difficulty to enforce constraints on the predictions, which in this case often leads to predicted solutions that are infeasible in the original optimization problem. Zamzam and Baker <ref type="bibr">32</ref> address this challenge, when constructing neural network proxies for the AC optimal power flow (OPF) problem, by generating a training set of strictly feasible solutions through a modified AC OPF formulation and by using the natural bounds of the sigmoid activation function to enforce generation and voltage limits. Van Hentenryck and coworkers <ref type="bibr">33</ref> apply Lagrangian duality to consider constraints in their proposed deep learning framework, where the loss function in ( <ref type="formula">5</ref>) is augmented with penalty terms derived from the Lagrangian dual of the original optimization problem and the corresponding Lagrange multipliers are updated using a subgradient method. This approach has been applied in several applications, including AC OPF, <ref type="bibr">33</ref> securityconstrained OPF, <ref type="bibr">34</ref> and job shop scheduling. <ref type="bibr">35</ref> Another remedy is to correct an infeasible prediction by projecting it onto a suitably chosen set such that the projection is a feasible solution. For example, in MPC, such a set can be derived from the maximal control invariant set of the system to ensure recursive feasibility. <ref type="bibr">36,</ref><ref type="bibr">37</ref> While the development of the machine learning approaches described above is a more recent trend, exact methods for the construction of optimization proxies have been studied in the area of multiparametric programming for a long time. In multiparametric programming, optimal solutions are derived as explicit functions of the model parameters; these functions (or policies) can change depending on the region in which in the specific parameter vector lies. The goal is to determine these so-called critical regions and the optimal policy associated with each critical region. Once these are obtained offline, the online optimization reduces to simply selecting and applying the right function from a look-up table. This approach forms the basis for explicit MPC, which has been successfully applied in various real-world settings. <ref type="bibr">38,</ref><ref type="bibr">39</ref> However, a major challenge in multiparametric programming is the curse of dimensionality as the number of critical regions grows exponentially with the problem size. We refer the reader to Oberdieck et al. <ref type="bibr">40</ref> for a review of the extensive literature on multiparametric programming.</p><p>Remark 1. In addition to the methods reviewed in this section, there is an extensive body of literature on the use of surrogate-based derivative-free optimization (DFO) techniques to solve complex optimization problems. <ref type="bibr">41</ref> We have consciously omitted referencing such work here. This is because our focus is specific, namely surrogate modeling for online optimization applications, where the same optimization model must be frequently solved with different values for its input parameters; hence, the surrogate model's parametric nature is crucial. The DFO approach lacks parametric characteristics such that one must solve the problem using the chosen DFO strategy every time the model parameters change, which means that the construction of the surrogate models within the DFO algorithm would have to happen online. As a result, DFO is typically not efficient enough to allow its use in real-time applications. On the other hand, parametric methods, like the ones reviewed here, offer a distinct advantage. With these methods, one trains a surrogate optimization model just once offline, and it can then be employed online to solve for any inputs u U without the need of retraining the surrogate model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">| DECISION-FOCUSED SURROGATE MODELING</head><p>The surrogate modeling frameworks reviewed in Section 2 (also summarized in Figure <ref type="figure">1</ref>) all have their advantages and disadvantages.</p><p>Optimization with embedded surrogate models is an intuitive approach that allows preservation of much of the structure of the original model. Domain knowledge can be effectively leveraged since for someone familiar with the physical system, it is usually easy to identify the part of the model that needs to be replaced as well as determine a suitable structure for the corresponding surrogate model.</p><p>Often, only a small number of constraints are complicating; in that case, simply keeping the remaining constraints can help ensure feasibility. However, surrogate models generated using these methods may lead to solutions that are quite different from the optimal solutions to the original problem. This shortcoming can be overcome by learning optimization proxies in a decision-focused fashion. Using tailored deep learning architectures, this approach can achieve highly accurate and fast surrogate models. However, it is often less data-efficient, and cannot easily incorporate safetycritical hard constraints. In the following, we introduce the concept of decision-focused surrogate modeling (Figure <ref type="figure">1C</ref>), where we try to combine the desirable characteristics of the existing surrogate modeling approaches. We further present an inverse optimization approach to constructing such surrogate models for a certain class of problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">| General formulation</head><p>In the proposed framework, the data generation process is the same as in optimization proxy learning, where the dataset is f&#240; u i , x * i &#222;g i I with x * i denoting the true optimal solution to the original problem with input u i . Given such data, we directly train a surrogate optimization model defined by objective function f and constraint functions &#285; that minimizes the decision prediction error. This learning problem can be formulated as follows: where f&#240;&#193;&#222; and &#285;&#240;&#193;&#222; can be chosen from some sets of functions F and G, respectively. The solution predicted by the surrogate optimization model for input u i is denoted by xi ; hence, the loss function is analogous to the one in the optimization proxy learning problem (5). As a result, this approach is decision-focused while allowing full flexibility in specifying the structure of the surrogate optimization model, including which original constraints to keep.</p><p>Note that since (DFSLP) replaces the original constraints g in ( <ref type="formula">1</ref>)</p><p>with their surrogates &#285;, solving the resulting surrogate optimization problem could lead to solutions that are infeasible to (1). This presents a trade-off for enhancing computational efficiency. This trade-off is also present in many other methods that we reviewed in Section 2.</p><p>Typical strategies for handling infeasible solutions involve projecting the predicted solution to the closest point on the feasible set or using the prediction to warm-start a fast local solver. With our surrogate modeling approach, infeasibility can be eased by retaining most of the original constraints and replacing only those posing computational challenges (similar to methods outlined in Section 2.1). In a subsequent paper, <ref type="bibr">42</ref> we introduce a robust optimization method to identify regions of potential infeasibility within U, which are then used to improve the surrogate model by acquiring additional training samples from these identified regions.</p><p>Remark 2. A closely related framework to our proposed decision-focused surrogate modeling approach is decision-focused learning. <ref type="bibr">43</ref> However, it is not motivated by the need for fast online optimization, but instead, it addresses the following problem: In traditional datadriven optimization, we often follow a two-stage predict-then-optimize approach, that is, we first predict unknown input parameters u from data with external features r and then solve the optimization problem with those predicted inputs u. Here, the learning step focuses on minimizing the parameter estimation error; however, this does not necessarily lead to the best decisions (evaluated with the true parameter values) in the optimization step. In contrast, decision-focused learning, also known as smart predict-then-optimize, <ref type="bibr">44</ref> predict-andoptimize, <ref type="bibr">45</ref> and end-to-end learning for optimization, <ref type="bibr">46</ref> integrates the two steps to explicitly account for the quality of the optimization solution in the learning of the model parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">| An inverse optimization approach</head><p>Problem (DFSLP) can be viewed as a data-driven inverse optimization problem. Given an observed decision made by some agent, the goal of inverse optimization 1 is to determine an optimization problem whose optimal solution matches and hence best explains the agent's decision; traditionally, only the objective function of the optimization model is assumed to be unknown. <ref type="bibr">47</ref> In the (noisy) data-driven setting, multiple decisions for different inputs are observed, and the goal is to find an optimization problem whose optimal solutions match the observations as closely as possible. <ref type="bibr">48</ref> Applying this interpretation to the decision-focused surrogate modeling problem, the original optimization problem acts as the agent, and for each observation i, x * i represents the observed decision, and u i is the corresponding input; the surrogate optimization problem is then the optimization problem that we try to find such that its optimal solutions best resemble the observations. Interpreting (DFSLP) as an inverse optimization problem allows us to leverage existing methods from that literature to solve the problem.</p><p>Recently, we developed an efficient data-driven inverse optimization framework that can incorporate both unknown objective functions and constraints. <ref type="bibr">49</ref> This approach can be readily applied to a certain class of decision-focused surrogate modeling problems, which we formally define in the following. Here, we consider the original optimization problem to be a generally nonconvex nonlinear program (NLP) of the following form: where functions f is strictly convex, &#285; are convex, and &#293; are linear in x.</p><p>In (SP), we parameterize the objective function and constraints with parameters &#952; and &#969;, respectively. This parameterized surrogate problem can be learned in a decision-focused manner by solving (DFSLP) with &#952; and &#969; as its decision variables; we use the term decision-focused surrogate optimization model (DFSOM) to describe the resulting (SP).</p><p>In this article, we use mean squared error as the loss function, that is,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">| Solution method</head><p>The learning problem (DFSLP) is a bilevel program <ref type="bibr">50</ref> with multiple lower-level optimization problems. Despite the convexity of the lower-level problems, (DFSLP) is difficult to solve due to its poor scalability. <ref type="bibr">51</ref> In our previous work, <ref type="bibr">49</ref> we addressed this problem with an efficient decomposition-based strategy that generates high-quality solutions; the same algorithm is employed here. For the sake of brevity, we only provide a brief overview of our solution method in this subsection and refer the interested reader to Gupta and Zhang <ref type="bibr">49</ref> for a more detailed discussion.</p><p>Our solution approach involves reformulating the bilevel problem into a single-level optimization problem by replacing the lower-level problems with their Karush-Kuhn-Tucker (KKT) optimality conditions. It is important to highlight that this reformulation requires that (SP) satisfies some constraint qualification conditions. This requirement can be met by designing (SP) with all linear constraints (as done in our computational experiments in Section 4),</p><p>or by ensuring that the set defined by &#285; &lt; 0 and &#293; &#188; 0 is always feasible. The reformulation of (DFSLP) using KKT conditions results in an NLP as follows:</p><p>where the dual variables for the inequality and equality constraints of the lower-level problems in (DFSLP) are respectively denoted by &#955; and &#956;. Constraints (7b)-(7e) formulate the stationarity, primal feasibility, and complementary slackness conditions for the lower-level problems.</p><p>The reformulated problem is a nonconvex NLP with complementarity constraints that violate standard constraint qualification conditions. As these problems are known to cause convergence difficulties for NLP solvers, we further consider an exact penalty reformulation 52 of ( <ref type="formula">7</ref>). This reformulation yields the following problem with a regularized feasible region that satisfies the necessary constraint qualifications if the description of sets &#920; and &#937; also satisfy necessary regularity conditions:</p><p>&#293;&#240;b x i , u i ; &#969;&#222;j P i I maxf0,&#285;&#240;b x i , u i ; &#969;&#222;g P i I j &#293;&#240;b x i , u i ;&#969;&#222;j P i I j&#955; &gt; i &#285;&#240;b x i , u i ; &#969;&#222;j 2 6 6 6 6 6 6 6 6 6 6 6 4 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl}</p><p>where c are the positive penalty parameters. The exactness of this reformulation holds only for penalty parameters greater than a certain threshold, which is hard to determine a priori in practice. As a result, we take an iterative approach, starting with small values for c and gradually increasing them in subsequent iterations (by a factor of &#961;)</p><p>until the solution of ( <ref type="formula">8</ref>) is also a feasible solution for (7). We use a feasibility tolerance of &#1013; as the termination criterion for the algorithm. A pseudocode that summarizes our overall approach is shown in Algorithm 1.</p><p>While ( <ref type="formula">9</ref>) can be solved using the penalty reformulation, the reformulated problem (8) remains a nonconvex NLP whose large instances are generally difficult to solve. Here, we notice that for certain classes of (SP), including quadratic programs (QPs), ( <ref type="formula">8</ref>) becomes a multiconvex optimization problem <ref type="bibr">53</ref>  solve (8) with BCD or an NLP solver (e.g., IPOPT)</p><p>4:</p><p>one with a smooth objective function by introducing additional variables to linearize the penalty terms. The reformulation results in an increased size of the problem; however, in our previous work, <ref type="bibr">49</ref> we show that (8)   provides significantly better solutions than when IPOPT <ref type="bibr">54</ref> is applied directly on (7).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">| COMPUTATIONAL CASE STUDIES</head><p>We present numerical results from three case studies based on typical nonlinear chemical engineering systems where we apply the proposed decision-focused surrogate modeling strategy to simplify the optimization task. The first two case studies are based on single-input systems for which we present detailed analyses demonstrating how our approach can provide excellent decision prediction accuracy with a much simpler model than is typically required by traditional methods that construct surrogate optimization problems with embedded surrogate models. The third case study is based on a larger multi-input multi-output blending network system where we show the data efficiency of our approach in comparison to black-box optimization proxies. We also demonstrate the effectiveness of the resulting DFSOM in reducing the computational burden of optimizing the system in real time. All computer code along with the datasets used for these case studies is available at <ref type="url">https://github.com/ddolab/DecFocSurrMod</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">| RTO of a continuous stirred tank reactor system</head><p>We consider the problem of RTO of a continuous stirred tank reactor (CSTR) operating in a stochastic environment. Specifically, we optimize the operation of an ideal adiabatic CSTR <ref type="bibr">55</ref> that is carrying out an exothermic reversible reaction between reactant A and product R.</p><p>The concentration of the inlet stream, which does not contain any R, is subject to observable disturbances. Here, the primary goal of RTO is to maximize the concentration of the product R in the outlet stream by manipulating the temperature of the inlet stream. This can be formulated as the following optimization problem: maximize Ti,To, Ao, Ro</p><p>where A i is the concentration of A in the inlet stream, and T i is the inlet temperature. Similarly, variables with the subscript o characterize the properties of the outlet stream. In (9), the first two constraints are the mass balances whereas the third constraint specifies the energy balance for the reactor. The paper by Economou et al. <ref type="bibr">55</ref> provides details of all model parameters used in this case study.</p><p>Problem ( <ref type="formula">9</ref>) is a nonconvex NLP due to the dependence of the forward and backward reaction rate constants (k 1 and k &#192;1 ) on the reactor temperature (T o ). We use decision-focused surrogate modeling to learn a convex surrogate optimization model for the RTO problem.</p><p>We achieve this by replacing the original rate law expression by the following approximation that is linear in the decision variables of (9):</p><p>where a, b, and c are functions of the input parameter A i . In addition, we also learn scalar coefficient values for a new quadratic objective function for the DFSOM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">| Surrogate model complexity and data efficiency</head><p>We begin by examining how using different functional forms for the coefficients a&#240;A i &#222;, b&#240;A i &#222;, and c&#240;A i &#222; in (10) affects the accuracy of the learned DFSOM. To simplify our analysis, we focus on these coefficients being polynomial functions of the input disturbance value. We test our model's accuracy by varying the complexity of the functions from constants up to cubic polynomials of A i . In each case, we train the DFSOM using a dataset I of 1000 samples, with each sample containing the input parameter A i and the corresponding optimal The results of the model complexity analysis are shown in Figure <ref type="figure">2A</ref>. In the plot, the "Sparse 3" column represents the case in which we specify the coefficient models to be cubic but add a term to the objective function of (DFSLP) that penalizes the sum of absolute values of the polynomial coefficients. This technique is commonly used in machine learning to control model complexity. We find that this approach results in simple yet highly accurate polynomials, so we choose "Sparse 3" as the model for our further analysis.</p><p>Figure <ref type="figure">2A</ref> shows that increasing the model complexity generally leads to a more accurate optimization problem. However, it is important to note that this added complexity lies in the input parameter space, and the optimization problem in the decision variable space stays convex. This is one of the unique features of our approach, allowing us to transfer the complexity from the decision variable space to the input space while still learning the relevant characteristics of the original optimization model. None of the traditional methods that employ embedded surrogate models can learn a similar model because their model learning step is entirely separate from the optimization step, so there is no differentiation between input parameter space and decision variable space when constructing the surrogate model.</p><p>We then investigate how the size of the training dataset affects the accuracy of our model. This is important because constructing the dataset requires solving a complex optimization problem. Figure <ref type="figure">2B</ref> shows that our method can produce an accurate and robust model with as few as 25 samples, indicating that a large dataset may not be necessary. In the next subsection, we compare the performance and data efficiency of our method with traditional methods using embedded surrogate models.</p><p>4.1.2 | Comparison with traditional embedded surrogate model approach To compare our proposed strategy to other surrogate modeling methods popular in PSE, we construct surrogate models for the rate law expression. We use a large training dataset of 3000 points, where each point consists of &#240;A o , R o , T o &#222; as inputs and the true rate value as the output. These surrogate models learn a function r&#240;A o , R o , T o &#222; that produces approximately the same reaction rate as the original model</p><p>(on the left-hand side) in (10). We then substitute the original rate law in (9) with r to obtain a surrogate optimization model, which we solve using the local NLP solver IPOPT. <ref type="bibr">54</ref> Note that although the resulting surrogate optimization models are generally nonconvex, we use a local solver because it is typically preferred in RTO settings due to the high computational effort required by global solvers.</p><p>We consider the following two surrogate modeling methods for the rate law expression:</p><p>&#8226; ES-ALAMO -We use ALAMO <ref type="bibr">15</ref> (version 2022.10.7) to estimate r.</p><p>We allow all available basis functions except for the sine and cosine functions. We manually substituted the algebraic r expressions obtained using ALAMO into an RTO problem implementation in the JuMP 56 modeling environment available in the Julia programming language.</p><p>&#8226; ES-NN -We train a neural network (NN) to estimate the reaction rate value. The NN model comprises three inputs, four hidden layers with 10 nodes, and a single output node. We use smooth sigmoid activation functions for the hidden layers to ensure that the resulting optimization model remains solvable with IPOPT. We used the Python library TensorFlow 57 v2.10.0 to train the NN models. These models were incorporated into the optimization problems using the OMLT 58 package v1.0.</p><p>Figure <ref type="figure">3</ref> shows a comparison of optimal T i values obtained using different methods. The DFSOM, trained with only 50 data points, significantly outperforms the two traditional methods. While ES-ALAMO yields a robust model that does not vary much with changing data, the predicted optimal T i values differ significantly from their true values.</p><p>In contrast, while ES-NN yields mean T i values close to the true optimum, its output is highly sensitive to the training dataset used. The decision-focused model provides both high accuracy and robustness, independent of the quality of the dataset.</p><p>We further evaluate the quality of different models by comparing the feasible regions of the surrogate optimization models with the true feasible region. This analysis is shown in Figure <ref type="figure">4</ref>. Methods that embed surrogate models in the optimization problem, as compared in Figure <ref type="figure">4A</ref>, aim to exactly replicate the feasible region, often requiring complex models that need larger training datasets. However, even with a good surrogate model that closely replicates the true feasible region, the decision prediction error can still be high due to small discrepancies, as we observe for ES-NN. In contrast, decision-</p><p>The true optimal solution obtained by solving ( <ref type="formula">9</ref>) is denoted by T * i , and Ti represents the solution of the decisionfocused surrogate optimization model. The box plots show the interquartile ranges of prediction error for the surrogate models obtained with 10 different instances of randomly generated training data. We use a separate test dataset of 100 samples to compute the prediction error. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>True</head><p>F I G U R E 3 Line plots show the mean optimal T i values obtained using different surrogate modeling methods, with shaded areas indicating one standard deviation. Five surrogate optimization models were developed for each method using different datasets.</p><p>focused learning achieves high accuracy with simple models by focusing only on predicting the optimal solutions. As shown in Figure <ref type="figure">4B</ref>, although the feasible region of the DFSOM is different from the true feasible region, they coincide exactly at the optimum for the original model. The learning problem (DFSLP) adjusts the surrogate optimization model's objective function to ensure this point is the optimal solution. Overall, we find that decision-focused learning allows us to significantly reduce the complexity of the surrogate optimization model while retaining its accuracy in predicting the optimal solutions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">| RTO of a heat exchanger network</head><p>Our second case study considers a heat exchanger network adapted from Biegler et al., <ref type="bibr">59</ref> as shown in Figure <ref type="figure">5</ref>. The inlet temperature of stream H2, denoted by T 5 , has a nominal value of 583 K but is subject to random disturbances. Upon a change in T 5 , we optimize the operation of the heat exchanger network by solving the following nonconvex NLP: minimize</p><p>The feasible regions of the original and surrogate real-time optimization problems projected onto the two-dimensional variable space between R o and T i . From left to right, the three line plots depict the feasible regions for A i &#188; 0:3, 0:6, and 1.0 mol/L, respectively. The contours in the background represent the objective function for the true model in (A) and for the decision-focused surrogate optimization model in (B), with darker colors indicating larger values.</p><p>F I G U R E 5 Given heat exchanger network.</p><p>where the cooling duty Q c and the heat capacity flow rate F H2 are adjustable variables.</p><p>The nonconvexity in (11) arises from the bilinear term in constraint (11c). Therefore, we apply our approach to replace the nonconvex term with the following approximation that is linear in the decision variables of ( <ref type="formula">11</ref>):</p><p>where a and b are functions of the input parameter T 5 . Along with this, we also learn new coefficients for the quadratic objective function. Unlike in the previous case study, here these coefficients are also functions of the input parameter T 5 . We specify the model for all unknown parameters as "Sparse 3" polynomials in T 5 (which we defined in Section 4.1.1).</p><p>To train the models for a, b, and the objective coefficients, we solve (DFSLP) with datasets of varying sizes. Each data point in the dataset consists of an input T 5 and the corresponding optimal Q * c and F * H2 values that we obtain by solving (11). We evaluate the performance of the resulting surrogate optimization models using a test dataset of 100 data points. The analysis of these results is available in Figure <ref type="figure">6A</ref>, where we show the evolution of prediction quality as a function of the training dataset size. We find that the prediction error of the models converges to its minimum value with as few as 50 data points. Therefore, we set jI j to 50 for our further analysis.</p><p>Next, we compare the performance of our decision-focused approach with the traditional approach with embedded surrogate models. To do this, we train an NN with two hidden layers, each consisting of ten nodes. The objective is to output the value of the bilinear term, given Q c and F H2 as inputs. The dataset used for this purpose consists of 1000 data points. Similar to the first case study, we maintain the sigmoid activation function for the hidden layers to guarantee the solvability of the surrogate optimization model with IPOPT. We then embed this NN in (11) by replacing the bilinear term, resulting in a surrogate optimization model that we call ES-NN.</p><p>Due to the simplicity of the function that the NN intends to replace, the NN learns to approximate the bilinear term almost perfectly. However, as we show in Figure <ref type="figure">6B</ref>, ES-NN does not always yield the true optimal Q c values as its solution. This happens because the NN-based problem is nonconvex and IPOPT recovers one of the two local solutions that are possible for a given T 5 value. Here, a global solver can be used to remedy this situation, but this may not be practical for RTO due to the associated computational expense. In contrast, the DFSOM, which is convex by design, correctly identifies the existence of the discontinuity in the global optimal solution space with IPOPT. Additionally, we find that the DFSOM produces solutions that are nearly identical to those produced by the true problem for all values of T 5 considered in this case study.</p><p>In conclusion, this case study showcases a notable benefit of decision-focused surrogate modeling, whereby the surrogate optimization models resulting from this approach exhibit convexity, rendering them amenable to efficient local solvers. This obviates concerns regarding the convergence to suboptimal solutions, which is a common issue when using nonconvex models in RTO applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">| RTO of a blending network</head><p>We consider a network of blending nodes that mix material streams of various specifications to produce products with desired qualities.</p><p>An optimization problem to minimize the cost of operation of this network is as follows <ref type="bibr">60</ref> : minimize f, x,q</p><p>x jk 8j J &#240;13b&#222;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(A) (B)</head><p>True solution</p><p>The true optimal solution obtained by solving (11) is denoted by Q * c , and Qc represents the solution of the surrogate optimization model. The box plots show the interquartile ranges of prediction error for the surrogate optimization models obtained with 10 different instances of randomly generated training data. (A) Effect of training dataset size on model accuracy. We use a separate test dataset of 100 samples to compute the prediction error. (B) Comparison of DFSOM with embedded surrogate model.</p><p>x jk &#8804; S k 8k K &#240;13d&#222; P j J q jw x jk &#8804; Z kw P j J</p><p>x jk 8k K, w W &#240;13e&#222;</p><p>where J is the set of blending nodes, and N j are the sets of material streams entering node j; we use variable f ij to denote the flow rate of stream i entering node j. The product streams are generated by combining the outflows from different blending nodes and are represented by the set K. The outflow rate from node j to product stream k is denoted by the variable x jk : Additionally, the set W contains the various components present in the material streams whose specifications need to be maintained in the output streams. We determine the quality of node j for a specific component w using the variable q jw .</p><p>In problem (13), we enforce mass balance for each node through constraints (13b). The quality of a blending node j is determined as a function of the specifications, &#955; ijw , of the streams entering that node, through constraints (13c). Additionally, we ensure that the outflow rate of a product stream k does not exceed its market demand, S k , using constraints (13d). Finally, constraints (13e) set the specification of component w in product stream k below its permissible level, Z kw .</p><p>The objective function of (13) comprises of three terms. The first represents the linear cost of acquiring a material stream, while the second penalizes the deviation of inlet flow rates from their nominal values, f ij .</p><p>The third term represents the revenue generated through the product streams.</p><p>We consider a scenario in which the RTO of the blending network operation is desired in the face of changing product demands, S &#188; fS k g k K . However, solving (13) is highly challenging due to the presence of bilinearities in constraints (13c) and (13e). In fact, efficient solution of the blending problem has been the subject of much research due to its importance in the operation of petroleum refineries and waste-water treatment plants, among others. <ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref> To address this issue, we propose a solution that involves training a convex DFSOM of (13) offline for online use. In what follows, we show that our approach can significantly speed up the online solution of the blending problem while preserving solution accuracy, as measured against the global solutions of (13).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">| Design and training of the DFSOM</head><p>Our approach involves linearizing the nonconvex terms in (13). For each j J , k K, and w W, there is a bilinear term q jw x jk in ( <ref type="formula">13</ref>), which we replace with the following approximation:</p><p>where p 0 jkwk 0 &#8467; and q 0 jkwk 0 &#8467; are scalar parameters to be determined by solving (DFSLP). Similar to the previous two problems, we penalize the sum of absolute values of p 0 jkwk 0 &#8467; and q 0 jkwk 0 &#8467; in the objective function to induce sparsity. In this problem, we keep the objective function in the decision-focused surrogate problem the same as in the original problem.</p><p>We test the proposed decision-focused surrogate modeling approach on the blending network presented in Example 2 of Foulds et al. <ref type="bibr">64</ref> This network has four pooling nodes blending a total of six inlet streams that result in four final products; for the exact parameter values used in this case study, we refer to Foulds et al. <ref type="bibr">64</ref> To generate training data for (DFSLP), we solve (13) with S k values sampled from the uniform distribution U&#240;100,200&#222; for all k in K: A single data point here consists of the input vector S and the corresponding optimal solution vector &#240;f, x, q&#222;:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">| Data efficiency of DFSOM in comparison to black-box optimization proxies</head><p>We examine the minimum amount of training data needed to create a high-quality surrogate. To do this, we solve (DFSLP) using different sizes of I , and then assess the prediction error on a separate test dataset of 100 points. Figure <ref type="figure">7A</ref> displays the results of this analysis.</p><p>Each box plot in the figure represents the interquartile range of prediction error for ten decision-focused surrogate models, each constructed using a different random training dataset of size jIj.</p><p>Remarkably, we find that, even for this high-dimensional problem, a low prediction error can be achieved with just 20 samples. Additionally, the prediction error achieves its minimum value with as few as 75 samples. These findings demonstrate the effectiveness of the proposed approach in developing surrogates with a small dataset, which is crucial since solving (13) to construct a large training dataset can be computationally burdensome.</p><p>The results in Figure <ref type="figure">7A</ref> are in direct contrast to the ones presented in Figure <ref type="figure">7B</ref> where we assess the accuracy of NN models that are trained to be optimization proxies for (13). We train three different types of networks to produce optimal f and x values given inputs S. The main difference between these networks is their size, which allows them to have different numbers of learnable parameters (our DFSOMs have 288 learnable parameters); the key characteristics of these networks are summarized in Table <ref type="table">1</ref>. We train these NNs with different numbers of training data points ranging from 10 to 4000.</p><p>For each NN and jI j combination, we train ten different models, each with a different random training dataset. Similar to the DFSOM case, we evaluate the prediction accuracy of the NNs on a separate test dataset of 100 samples.</p><p>As shown in Figure <ref type="figure">7B</ref>, even with 4000 data points, the NNs exhibit significantly worse performance compared to the DFSOMs.</p><p>We believe this is a direct consequence of the black-box nature of NNs, which does not allow for the incorporation of known correlations about the data, in contrast to the DFSOM. It is also important to note that the DFSOM output will always satisfy the crucial mass balances in (13b) and demand satisfaction constraints in (13d), whereas there is no trivial way of ensuring the same for the NN outputs. These findings further highlight the fact that the proposed decision-focused surrogate modeling approach can be a superior alternative to NNs when it comes to generating high-quality surrogates for computationally challenging optimization problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">| Computational performance of DFSOMs for RTO</head><p>We analyze the computational performance of the DFSOM in contrast to (13). We solve 1000 instances of a DFSOM and (13), with each instance corresponding to different S values following the same distribution as the training data. The optimization solver Gurobi is used to solve (DFSLP), while we solve each instance of (13) twice, once with the global solver Gurobi and again with the local solver IPOPT. A histogram in Figure <ref type="figure">8</ref> shows the distribution of computation times for the three cases. We find that, depending on S, the global solution of (13) can take more than 100 s to find, possibly making it intractable for online application. While the local solver provides a fast solution in most cases, the generated solutions generally suffer from large optimality gaps as shown in Table <ref type="table">2</ref>. IPOPT-generated solutions had a mean optimality gap of 61%, which can be as high as 230% in the worst case.</p><p>In contrast, DFSOM is highly effective in alleviating computational burden associated with the global solution of this challenging optimization problem. We find that all 1000 instances of DFSOM solve in less than 0.1 s, and even the worst-case optimality gap is only 0.9%. Therefore, in addition to being a superior surrogate modeling strategy, using decision-focused surrogate modeling can also be a better alternative to suboptimal local solvers.</p><p>Remark 4. In all three numerical case studies, we employ polynomial functions of inputs to represent the nonlinear coefficient models. For instance, in (10), a&#240;A i &#222;, b&#240;A i &#222;, and c&#240;A i &#222; are modeled as cubic polynomials in A i .</p><p>The choice of polynomials stems from their linearity in T A B L E 1 Key characteristics of neural networks (NNs) trained as optimization proxies for (13). </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>15475905, 2024, 4, Downloaded from https://aiche.onlinelibrary.wiley.com/doi/10.1002/aic.18338 by UNIVERSITY OF MINNESOTA 170 WILSON LIBRARY, Wiley Online Library on [25/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License</p></note>
		</body>
		</text>
</TEI>
