<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>From Function to Distribution Modeling: A PAC-Generative Approach to Offline Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>NeurIPS 2024 Workshop on Data-driven and Differentiable Simulations, Surrogates, and Solvers (D3S3)</publisher>
				<date>10/30/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10596088</idno>
					<idno type="doi"></idno>
					
					<author>Q Zhang</author><author>R Zhou</author><author>Y Shen</author><author>T Liu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper considers the problem of offline optimization, where the objective function is unknown except for a collection of “offline" data examples. While recent years have seen a flurry of work on applying various machine learning techniques to the offline optimization problem, the majority of these works focused on learning a surrogate of the unknown objective function and then applying existing optimization algorithms. While the idea of modeling the unknown objective function is intuitive and appealing, from the learning point of view it also makes it very difficult to tune the objective of the learner according to the objective of optimization. Instead of learning and then optimizing the unknown objective function, in this paper we take on a less intuitive but more direct view that optimization can be thought of as a process of sampling from a generative model. To learn an effective generative model from the offline data examples, we consider the standard technique of “re-weighting", and our main technical contribution is a probably approximately correct (PAC) lower bound on the natural optimization objective, which allows us to jointly learn a weight function and a score-based generativemodel from a surrogate loss function. The robustly competitive performance of the proposed approach is demonstrated via empirical studies using the standard offlineoptimization benchmarks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Offline optimization refers to the problem of optimizing an unknown real-valued objective function f based only on a collection of "offline" data examples (x i , f (x i )), &#8704;i &#8712; [m] := {1, 2, . . . , m}, where each x i is an independent sample drawn from an unknown distribution p data . Aside from these examples, no additional information on the objective function f is available prior to or during the optimization process, and hence the name "offline optimization". This rather restrictive setting is particularly relevant to the optimization scenarios where: i) the objective function is very complex and no structural information is available; and ii) querying the objective function is very expensive. Obviously, offline optimization is a more challenging setting than standard optimization <ref type="bibr">[6]</ref>, where full structural information on the objective function is available; or black-box optimization <ref type="bibr">[5]</ref>, where even though no structural information on the objective function is available, the objective function can be queried upon during the optimization process. Therefore, instead of aiming at the global optima, for offline optimization we are usually satisfied with finding a few candidates, among which there are significantly better<ref type="foot">foot_0</ref> solutions than the existing offline observations. Under this lesser objective, a direct application of offline optimization is experimental design, for which the candidates are to be efficiently found without querying objective functions through experiments that are often slow or expensive. Applications in the literature include the design of proteins <ref type="bibr">[32]</ref>, chemical molecules <ref type="bibr">[21]</ref>, DNA sequences <ref type="bibr">[29]</ref>, aircraft <ref type="bibr">[25]</ref>, robots <ref type="bibr">[39]</ref>, and hardware accelerators <ref type="bibr">[37]</ref>.</p><p>Related work. Traditionally, offline optimization has been mainly approached through the Bayesian view, i.e., by endowing the unknown objective function f a prior distribution. This has led to a large body of work under the name Bayesian optimization; see Fu and Levine <ref type="bibr">[19]</ref> and the references therein for the recent progress in this direction. Motivated by the rapid progress in machine learning, recent years have also seen a flurry of work on offline optimization from a frequentist's view <ref type="bibr">[11,</ref><ref type="bibr">22,</ref><ref type="bibr">35,</ref><ref type="bibr">45]</ref>, i.e., by modeling the objective function f as a deterministic but unknown function. However, most of these works have been focusing on learning a surrogate of the unknown objective function and then applying existing optimization algorithms (see "Function Modeling" in Figure <ref type="figure">1</ref>). Prime examples include Trabucco et al. <ref type="bibr">[45]</ref>, Brookes et al. <ref type="bibr">[11]</ref>, Gupta and Zou <ref type="bibr">[22]</ref>, Chen et al. <ref type="bibr">[14]</ref>.</p><p>While the idea of modeling the unknown objective function is intuitive and appealing, from the learning point of view it also makes it very difficult to tune the objective of the learner according to the objective of optimization <ref type="bibr">[45,</ref><ref type="bibr">11,</ref><ref type="bibr">22]</ref>. As a result, it is very difficult to gauge whether these previous approaches actually come with any theoretical guarantees.</p><p>Main contribution. Figure <ref type="figure">1</ref> illustrates the key differences between function modeling and distribution modeling for offline optimization. In sharp contrast to function modeling, our approach does not explicitly learn a surrogate of the unknown objective function, but directly learns a generative model <ref type="foot">3</ref> from the offline examples using a theoretically-grounded surrogate of the natural optimization objective: J opt (&#952;) = E x&#8764;p &#952; [f (x)]. The main contribution can be summarized as follows:</p><p>We derive a PAC bound on the natural optimization objective, allowing the generative model and weight function to be jointly learned from a surrogate loss function that aligns the learner's and optimization objectives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">The Method</head><p>In this paper we take on a less intuitive but more direct view of optimization and consider it as a process of sampling from a generative model. There are two natural advantages to this view. First, through sampling exploration is now intrinsic in the optimization process. Second, this view allows us to shift our focus from modeling the objective function to modeling a target distribution. Unlike learning a surrogate on the objective function, as we shall see, the objective of learning a generative model can be naturally aligned with the objective of optimization, thus bringing theoretical guarantees on the optimization performance.</p><p>More specifically, let p &#952; be a generative model from which sampling can produce, with high probability, samples whose objective values are significantly better than the offline observations. Note that unlike the traditional generative models, whose purpose to generate samples that are "similar" to the training examples, the goal of our generative model p &#952; is to generate samples with superior objective values than the offline observations. Relative to the data-generating distribution p data , these targeted samples with superior objective values are the "outliers". Therefore, from the learning perspective, our main challenge here is to learn a generative model that generates outliers rather than the norm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Learning a Generative Model with a Pre-selected Normalized Weight Function</head><p>To facilitate the learning of a desired generative model, in this paper we shall consider the standard technique of "re-weighting" <ref type="bibr">[12]</ref>. Roughly speaking, we shall consider a weight function that assigns higher weights to the domain points with better objective values and then train a generative model using the weighted offline examples. This allows us to tune the generative model towards generating samples with better objective values.</p><p>Formally, let q target (x) := w(f (x)) &#8226; p data (x) be a hypothetical target distribution, where w is a normalized, non-negative weight function such that E x&#8764;p data [ w(f (x))] = 1, and p data is the (unknown) data-generating distribution from which the offline observations x [m] := (x i : i &#8712; [m]) were drawn. In our approach, the hypothetical target distribution q target plays dual roles: On one hand, it serves as the hypothetical learning target of the generative model p &#952; ; on the other hand, it is also connected to the unknown data-generating distribution p data via the normalized weight function w and hence allows a generative model p &#952; to be learned from the offline data examples. Operationally, we would like to train a generative model p &#952; such that p &#952; &#8776; q target . But what would be a suitable choice for the normalized weight function w?</p><p>In this section, we shall focus on training a score-based model, which is mainly motivated by the following connection between the score function of the hypothetical target distribution q target and the gradient of the unknown objective function f :</p><p>where s target and s data are the score functions of q target and p data , respectively. If w is monotone increasing, the derivative w&#8242; (f (x)) &gt; 0 for all x &#8712; X . In this case, the score function s target is aligned with the gradient of the objective function f , so sampling along the direction of s target will naturally produce samples with high objective values.</p><p>In practice, the normalized weight function can be either pre-selected or learned from the offline data examples. In the former case, as we have seen, the generative model p &#952; can be learned via a loss function that is simply the weighted version of the loss function for training a standard generative model. In the latter case, however, a priori, it is unclear how to construct a loss function that would allow us to jointly learn a generative model p &#952; and a normalized weight function w.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Jointly Learning a Generative Model and a Normalized Weight Function</head><p>To address the above challenge, in this section we shall start with the following natural optimization objective J opt (&#952;) for identifying a desired generative model p &#952; . The above natural optimization objective, however, cannot be evaluated for any &#952;, because the objective function f is unknown. Instead of trying to learn a surrogate on f and then use it to guide the training of the generative model, here we consider the more learning-theoretic approach of constructing a probably approximately correct (PAC) <ref type="bibr">[42]</ref> bound on J opt . Unlike J opt , which depends only on &#952;, the PAC bound depends on both &#952; and the normalized weight function w. As we shall see, not only it captures both the utility and learnability considerations for selecting w, it will also naturally suggest a surrogate loss function, from which both a generative model p &#952; and a normalized weight function w can be jointly learned from the offline data examples.</p><p>To construct an appropriate loss function, let us assume without loss of generality that our goal of optimization is to maximize an unknown objective function f . Our proposed loss function is based on the following PAC lower bound on the natural optimization objective J opt (&#952;): Incorporating proper changes to the PAC lower bound (1) leads to the following objective for jointly learning a (un-normalized) weight function w &#981; and a score-function model s &#952; t :</p><p>Details regarding the PAC lower bound (1), the surrogate loss function (2), as well as the training and sampling algorithms are elaborated in A.4, A.5, and A.6, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental Results</head><p>We assess the performance of the proposed learning algorithm using the four standard tasks (Superconductor, TF Bind 8, GFP, and UTR) from the Design-Bench benchmark <ref type="bibr">[46]</ref>. In addition, we have included the "Fluorescence" task from Fannjiang et al. <ref type="bibr">[18]</ref> for a comprehensive evaluation.</p><p>Evaluation. We generated a total of N = 128 designs for each task and subsequently computed the mean and standard deviation of the 100th percentile of the normalized ground truth over eight independent trials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>. The results are listed in Table <ref type="table">1</ref>, where D best denotes the normalized maximum objective value among the initial samples; "Grad" and "COMs" refer to <ref type="bibr">[45]</ref>; "CbAS" refers to <ref type="bibr">[11]</ref>; "&#968; = 20" refers to our proposed approach with a pre-selected weight function w(y) = exp(&#968;y) for &#968; = 20; and "&#945; = 0.25" refers to our proposed approach with a learnable weight function and hyper-parameters &#945; = 0.25, &#955; = 0.1. Results that fall within one standard deviation of the best performance are highlighted in bold.</p><p>Note that across all tasks, our proposed approaches demonstrate not only notable improvement over the best initial samples, but also consistently competitive performances against the other three prominent offline optimization algorithms. Quantitatively, our proposed approach with a learnable weight function achieves the highest average improvement over all five tasks, where the improvement over a specific task is defined as (y -D best )/D best . We believe that this superior consistency is rooted in our modeling perspective and the theoretically-grounded design of the learning algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In this paper, we introduced a novel generative approach to offline optimization by shifting the focus from traditional function modeling to distribution modeling. This approach leverages the concept of sampling from a generative model rather than optimizing a surrogate of the unknown objective function. We proposed a PAC-bound framework that enables the joint learning of a generative model and a weight function, ensuring that the learning process is aligned with the optimization goals. Our empirical results, validated on standard offline optimization benchmarks, demonstrate the robustness and competitive performance of the proposed method. Future research could explore further refinements of the PAC-bound framework and its applications to more diverse optimization problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendices</head><p>Contents List of Symbols A Technical Details A.1 The denoising diffusion probabilistic model (DDPM) . . . . . . . . . . . . . . . . A.2 Weighted learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Trade-off between utility and learnability . . . . . . . . . . . . . . . . . . . . . . A.4 The main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 From PAC lower bound to surrogate loss function . . . . . . . . . . . . . . . . . . A.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Proof of the Main Theorem B.1 Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Generalization bound for weighted learning . . . . . . . . . . . . . . . . . . . . . B.3 A distribution-dependent surrogate on the natural optimization objective . . . . . . B.4 Proof of Theorem A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C Additional Experiments and Details C.1 A toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Benchmark datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Additional experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . D Further Discussions List of Symbols The next list describes several symbols that are used within the entire body of the paper. &#945;, &#955; Hyper-parameters (learnable case) &#8467; DSM Denoising score matching loss function R Rademacher complexity D best Maximum objective value within the offline dataset N Normal distribution p data Data-generating distribution &#968; Hyper-parameter (pre-selected case) q target Target distribution (hypothetical) w Normalized weight function x Domain point, Design or Feature vector x [m] A set of offline domain points {x 1 , x 2 , . . . , x m } x i Offline domain point f Unknown objective function f (x i ) Offline objective value f &#952; Parameterized surrogate objective function J opt Natural optimization objective p &#952; Parameterized generative model w &#981; Parameterized weight function (unnormalized) W p p-Wasserstein distance y Objective value, Label or Score A Technical Details A.1 The denoising diffusion probabilistic model (DDPM)</p><p>For score-based generative models, we are particularly interested in the denoising diffusion probabilistic model (DDPM) <ref type="bibr">[44,</ref><ref type="bibr">24]</ref> due to its stability and performance on high-dimensional datasets.</p><p>Below we first recall a few essential results on the DDPM.</p><p>Consider a forward process of continuously injecting white Gaussian noise into a signal x t :</p><p>where &#946; : [0, 1] &#8594; R ++ is a positive noise scheduler, w t is a standard Wiener process <ref type="bibr">[28]</ref>, and time in this process is assumed to flow in the forward direction from t = 0 to t = 1. Denote by q t the marginal distribution of x t from the forward process (3). The DDPM is mainly motivated by the fact that the marginal distributions q t , t &#8712; [0, 1], can be recovered through the following reverse process <ref type="bibr">[3]</ref>:</p><p>where s &#952; t is a model of the score function of q t and wt is (again) a standard Wiener process but with time flowing backward from t = 1 to t = 0. More specifically, let p 1 be the initial distribution of x 1 from the reverse process (4) and let p &#952; t be the marginal distribution of x t , t &#8712; [0, 1) from the reverse process (4). If we let p 1 = q 1 and s &#952; t be the exact score function of q t for all t &#8712; [0, 1], we have p &#952; t = q t for all t &#8712; [0, 1) <ref type="bibr">[10]</ref>. In particular, the initial distribution of the forward process q 0 can be recovered at the end of the reverse process via the score functions of q t , t &#8712; [0, 1]. Thus, to learn the initial distribution of the forward process q 0 , it suffices to learn a model s &#952; t that approximates the score functions of q t for all t &#8712; [0, 1].</p><p>There are several methods <ref type="bibr">[26,</ref><ref type="bibr">48,</ref><ref type="bibr">43</ref>] that allow a model s &#952; t to be learned from a training dataset drawn from q 0 . Here we focus on the denoising score matching (DSM) method due to its scalability to large datasets. For the DSM method, a model s &#952; t is learned by minimizing the following DSM loss:</p><p>where</p><p>is the point-wise DSM loss, x t = 1 -&#963;(t) 2 x + &#963;(t)z, &#955; : [0, 1] &#8594; R ++ can be any positive function, and</p><p>While the previous DSM loss can be easily estimated from a dataset drawn from q 0 and hence is very conductive to learning, a priori it is unclear how it would connect to any generative loss between q 0 and p &#952; 0 . Interestingly, it was shown in Theorem 2 and Corollary 3 of Kwon et al. <ref type="bibr">[38]</ref> that under some (relatively) mild conditions<ref type="foot">foot_2</ref> on &#946;, q 0 , and s &#952; , by choosing &#955;(t) = &#946;(t) we have</p><p>where W 2 (q, p) is the 2-Wasserstein distance <ref type="bibr">[1]</ref> between the probability distributions q and p, and c 0 and c 1 are constants that only depend on the choice of the noise scheduler &#946; and some prior knowledge on p 0 and s &#952; t but is independent of the model parameter &#952;. In Kwon et al. <ref type="bibr">[38]</ref>, this result was coined as "score-based generative modeling secretly minimizes the Wasserstein distance".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Weighted learning</head><p>To learn a generative model p &#952; &#8776; q target , we shall let q 0 = q target and p 1 be the standard multivariate Gaussian distribution N . If we denote q 1 and p &#952; 0 by qtarget and p &#952; respectively, by <ref type="bibr">(7)</ref> we have</p><p>We mention here that the output distribution of the forward process qtarget is potentially dependent on the choice of w, even though this dependency is not explicit from the notation. In practice, however, qtarget can be made very close to the standard multivariate Gaussian distribution N with an appropriate choice of the noise scheduler &#946;. Therefore, the Wasserstein distance W 2 (q target , N ) is very small and is usually disregarded from the learning process.</p><p>To estimate the DSM loss E x&#8764;qtarget &#8467; &#952; DSM (x) from the offline data examples, note that by the definition of q target , we have</p><p>where ( <ref type="formula">9</ref>) is also known as "re-weighting" or importance sampling <ref type="bibr">[12]</ref>. By <ref type="bibr">(8)</ref>, minimizing the empirical weighted DSM loss <ref type="bibr">(10)</ref> can help to identify a generative model p &#952; &#8776; q target for a pre-selected normalized weight function w.</p><p>Finally, we note that scaling the normalized weight function w does not change the optimal solution that minimizes the empirical weighted DSM loss <ref type="bibr">(10)</ref>. Therefore, when using <ref type="bibr">(10)</ref> to identify a generative model p &#952; , one can use an un-normalized weight function instead of a normalized one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Trade-off between utility and learnability</head><p>Intuitively, there are two considerations for selecting a normalized weight function w. On one hand, from the utility point of view, we would like to choose w such that the hypothetical target distribution q target focuses most of its densities on the domain points with superior objective values. This can be achieved, for example, by choosing w to be heavily skewed towards superior objective values.</p><p>On the other hand, from the learning viewpoint, the generative model p &#952; is learned from the offline observations, which were generated from the unknown data-generating distribution p data . If w is chosen to be heavily skewed, the hypothetical target distribution q target then becomes very different from the data-generating distribution p data . In this case, learning the generative model p &#952; from the offline data examples may be subject to very high sample complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 The main theorem</head><p>Our proposed loss function is based on the following PAC lower bound on the natural optimization objective:</p><p>Theorem A.1. Assume that: i) the unknown objective function f is K-Lipschitz and satisfies |f (x)| &#8804; F for all x &#8712; X ; ii) the generative model p &#952; is a DDPM; iii) the point-wise DSM loss &#8467; &#952; DSM satisfies 0 &#8804; &#8467; &#952; DSM (x) &#8804; &#8710; for all x &#8712; X and all &#952; &#8712; &#920;; and iv) the conditions from Section 3.1 of Kwon et al. <ref type="bibr">[38]</ref> on the noise scheduler &#946;, the data-generating distribution p data , the normalized weight function w, and the score-function model s &#952; t are satisfied. Let W be the collection of all normalized weight functions w that are L-Lipschitz and satisfy 0 &#8804; w(y) &#8804; B for any y &#8712; [-F, F ]. With probability &#8805; 1 -&#948;, we have for any w &#8712; W and any &#952; &#8712; &#920;,</p><p>where</p><p>is the empirical utility of w,</p><p>is the empirical weighted DSM loss of s &#952; t ,</p><p>is the empirical variance of w,</p><p>is the empirical Rademacher complexity with respect to the parameter family &#920;, and the last term</p><p>is independent of the model parameter &#952; and w.</p><p>The proof of the above theorem is long and technical and hence is deferred to next section to enhance the flow of the paper. We mention here that the proof utilizes several key technical results including the duality theorem of Kantorovich-Rubenstein <ref type="bibr">[2]</ref> for the 1-Wasserstein distance <ref type="bibr">[4]</ref>, the fact that "score-based generative modeling secretly minimizes the Wasserstein distance" <ref type="bibr">[38]</ref>, the Rademacher bound for bounded functions <ref type="bibr">[42]</ref>, the Wasserstein contraction property <ref type="bibr">[9]</ref>, and the covering bound for Lipschitz functions <ref type="bibr">[47]</ref>.</p><p>Note that to maximize the PAC lower bound (1), we need to simultaneously maximize the utility of w and minimize the weighted DSM loss of s &#952; t and the variance of w. Therefore, the PAC lower bound (1) captures both the utility and learnability considerations for selecting a normalized weight function w.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5 From PAC lower bound to surrogate loss function</head><p>To jointly learn a generate model p &#952; and a normalized function w, first note that the last two terms of the PAC lower bound (1) are independent of &#952; and w and hence can be ignored from the learning objective. The forth term is due to the "initial" sampling error of the reverse process. As discussed previously in Section A.2, while this term is potentially dependent on the normalized weight function w, in practice it can be made very small by choosing an appropriate noise scheduler &#946; and hence will be ignored from our learning objective.</p><p>To make the first three terms learnable, we consider the following two modifications to the bound.</p><p>First, the coefficients c 0 K and c 0 K &#8730; 2&#8710; in the second and the third term require some prior knowledge on the unknown data-generating distribution p data and the unknown objective function f . In practice, we replace them by two hyper-parameters &#955; and &#945;, respectively. We mention here that the hyper-parameter &#945; plays a particular important role in the learning objective, as it controls the utility-learnability tradeoff for selecting a normalized weight function w.</p><p>Second, the weight function w needs to be normalized with respect to the unknown data-generating distribution p data and the unknown objective function f . In practice, we let w(&#8226;) =</p><p>where w &#981; is an un-normalized weight function parameterized by a second parameter &#981;, and</p><p>] is the normalizing constant. While the exact calculation of Z &#981; again requires the knowledge of p data and f , in practice it can be easily estimated from the offline data examples as &#7824;&#981; = 1 m m i=1 w &#981; (f (x i )).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.6 Algorithms</head><p>Algorithm 1 TRAINING 1: Input: Offline examples (x i , f (x i )); hyper-parameters &#945;, &#955;; learning-rate parameters &#951; 1 , &#951; 2 . 2: General step:</p><p>Algorithm 2 SAMPLING/OPTIMIZATION</p><p>1: Input: Score function model s &#952; * t (x), number of samples N , number of steps T , noise scheduler parameters (&#946; min , &#946; max ), and</p><p>T , . . . , x</p><p>end for 8: end for 9: Output: Optimized samples x</p><p>(1) 0 , x (2) 0 , . . . , x (N ) 0 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of the Main Theorem</head><p>In this section, we introduce a few technical results, which will lead to the proof of the main theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Wasserstein distance</head><p>Let &#181; and &#957; be two probability distributions on R d . A coupling &#947; between &#181; and &#957; is a joint distribution on R d &#215; R d whose marginals are &#181; and &#957;. The p-Wasserstein distance between &#181; and &#957; (with respect to the Euclidean norm) is given by <ref type="bibr">[1]</ref>:</p><p>where &#915;(&#181;, &#957;) is the set of all couplings between &#181; and &#957;, and &#8741; &#8226; &#8741; denotes the standard Euclidean norm.</p><p>The 1-Wasserstein distance, also known as the earth mover's distance, has an important equivalent representation that follows from the duality theorem of Kantorovich-Rubenstein <ref type="bibr">[2]</ref>:</p><p>where &#8741; &#8226; &#8741; Lip denotes the Lipschitz norm. In our construction, this dual representation of the 1-Wasserstein distance serves as the bridge between the objective-specific generative loss and the generic generative loss.</p><p>By the standard Jensen's inequality <ref type="bibr">[16]</ref>, we also have</p><p>for any two distributions &#181; and &#957;. As we shall see, this simple relationship between the 1-Wasserstein and 2-Wasserstein distances can help to further connect the objective-specific generative loss to the DSM loss for training a score-based generative model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Generalization bound for weighted learning</head><p>Let &#8467; &#952; : X &#8594; R be a bounded loss function parameterized by &#952; &#8712; &#920; such that 0 &#8804; &#8467; &#952; (x) &#8804; &#8710; for all x &#8712; X and all &#952; &#8712; &#920;. Consider the problem of estimating the expected weighted loss</p><p>where w : X &#8594; R + is a normalized, bounded weight function such that E x&#8764;p [ w(x)] = 1 and 0 &#8804; w(x) &#8804; B for all x &#8712; X . We have the following PAC upper bound, with respect to the parameter family &#920;, on the expected weighted loss L p (&#952;, w) for any given w. Lemma B.1. For any given w, with probability &#8805; 1 -&#948; we have for any &#952; &#8712; &#920;</p><p>where</p><p>2 is the empirical variance of w over x [m] , and</p><p>Proof. By assumption, we have 0 &#8804; w(x) &#8804; B for any x &#8712; X and 0 &#8804; &#8467; &#952; (x) &#8804; &#8710; for any x &#8712; X and any &#952; &#8712; &#920;. It follows immediately that the weighed loss function w(x)&#8467; &#952; (x) satisfies 0 &#8804; w(x)&#8467; &#952; (x) &#8804; B&#8710; for any x &#8712; X and any &#952; &#8712; &#920;. Applying the standard Rademacher bound to the weighted loss function class ( w(x)&#8467; &#952; (x) : &#952; &#8712; &#920;), with probability &#8805; 1 -&#948; we have for any</p><p>where</p><p>is the empirical weighted Rademacher complexity <ref type="bibr">[42]</ref> with respect to the parameter family &#920; over x <ref type="bibr">[m]</ref> . The empirical weighted Rademacher complexity R w</p><p>x [m] (&#920;) can be further bounded from above as:</p><p>Further note that</p><p>for any &#952; &#8712; &#920; and any realization of &#963; [m] , where the first inequality follows from the standard Cauchy-Schwarz inequality, the second equality follows from the fact that the square of a Rademacher variable takes a constant value of 1, and the last inequality follows from the assumption that 0 &#8804; &#8467; &#952; (x) &#8804; &#8710; for any x &#8712; X and any &#952; &#8712; &#920;. It follows immediately that</p><p>Substituting ( <ref type="formula">23</ref>) into <ref type="bibr">(22)</ref> gives</p><p>where</p><p>and</p><p>is the empirical (un-weighted) Rademacher complexity with respect to the parameter family &#920; over x <ref type="bibr">[m]</ref> . Substituting <ref type="bibr">(24)</ref> into <ref type="bibr">(21)</ref> completes the proof of <ref type="bibr">(20)</ref> and hence Lemma B.1.</p><p>The main insight from the above lemma is that the generalization error (with respect to the parameter &#952;) between the expected weighted loss L p (&#952;, w) and the empirical weighted loss Lx [m] (&#952;, w) can be controlled by controlling the complexity of the model class &#920; and the variance of the normalized weight function w.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 A distribution-dependent surrogate on the natural optimization objective</head><p>Proposition B.2. Assume that the unknown objective function f is K-Lipschitz and the generative model p &#952; is a DDPM. Under the assumptions from Section 3.1 of Kwon et al. <ref type="bibr">[38]</ref> on the noise scheduler &#946;, the data-generating distribution p data , the normalized weight function w, and the score-function model s &#952; t , we have</p><p>expected weighted DSM loss</p><p>where &#8467; &#952; DSM is the point-wise DSM loss of s &#952; t as defined in <ref type="bibr">(6)</ref>, qtarget is the output distribution of the forward process (3), &#934; is the standard Gaussian distribution, and c 0 and c 1 are constants that are independent of the model parameter &#952; and w.</p><p>Proof. We start by writing J opt (&#952;) as:</p><p>) By the definition of q target :</p><p>where the first inequality follows directly from the assumption that f is K-Lipschitz, and the second equality follows from the dual representation (30) of the 1-Wasserstein distance.</p><p>Under the assumption that p &#952; is a DDPM, we can further bound the 1-Wasserstein distance W 1 (q target , p &#952; ) as:</p><p>where the first inequality follows from <ref type="bibr">(19)</ref>, and the second inequality follows from <ref type="bibr">(8)</ref> and the definition of q target in <ref type="bibr">(27)</ref>. Substituting ( <ref type="formula">28</ref>), <ref type="bibr">(30)</ref>, and (31) into <ref type="bibr">(26)</ref> completes the proof of <ref type="bibr">(25)</ref>.</p><p>Next, we shall convert the distribution-dependent surrogate on the right-hand side of (25) into a PAC lower bound using the standard complexity theory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 Proof of Theorem A.1</head><p>For the proof, we shall write q target and qtarget as q w target and q w target respectively, to emphasize their dependencies on the normalized weight function w. Let us first recall from Proposition B.2 that the natural optimization objective J opt (&#952;) can be bounded from below as: </p><p>where</p><p>is the empirical utility of w. Next by Lemma (B.1), with probability &#8805; 1 -&#948; &#8242; /2 we have for any &#952; &#8712; &#920;</p><p>where</p><p>2 is the empirical variance of w over x [m] , and</p><p>is the empirical Rademacher complexity with respect to the parameter family &#920; over x <ref type="bibr">[m]</ref> . Substituting <ref type="bibr">(33)</ref> and ( <ref type="formula">34</ref>) into <ref type="bibr">(32)</ref>, with probability &#8805; 1 -&#948; &#8242; we have for any &#952; &#8712; &#920;</p><p>To remove the conditioning on w, let W&#1013; be an &#1013;-cover <ref type="bibr">[47]</ref> of W under the L &#8734; norm. By <ref type="bibr">(35)</ref>, for any given &#7805; &#8712; W&#1013; , with probability &#8805; 1 -&#948; &#8242; we have for any &#952; &#8712; &#920;</p><p>By the union bound, with probability &#8805; 1 -| W&#1013; |&#948; &#8242; we have for any &#952; &#8712; &#920; and any &#7805; &#8712; W&#1013;</p><p>Choosing &#948; &#8242; = &#948;/| W&#1013; |, with probability &#8805; 1 -&#948; we have for any &#952; &#8712; &#920; and any &#7805; &#8712; W&#1013;</p><p>By the definition of &#1013;-cover, for any w &#8712; W, there exists an &#7805; &#8712; W&#1013; such that | w(f (x))-&#7805;(f (x))| &#8804; &#1013; for any x &#8712; X . Note that this immediately implies that:</p><p>where the last inequality follows from the assumption that |f (x)| &#8804; F for any x &#8712; X ;</p><p>where the last inequality follows from the assumption that 0 &#8804; &#8467; &#952; DSM (x) &#8804; &#8710; for any x &#8712; X ;</p><p>where the last inequality follows from the assumption that 0 &#8804; w(f (x)) &#8804; B for any x &#8712; X ; and</p><p>where c 2 is the Wasserstein contraction constant <ref type="bibr">[9]</ref> of the forward process (3), d 2 (X ) := max x,x &#8242; &#8712;X &#8741;x -x &#8242; &#8741; 2 is the diameter of X with respect to the &#8467; 2 norm, and d TV (q &#7805; target , q w target ) denotes the total variation distance between q &#7805; target and q w target . Here, the first inequality follows from the fact that the 2-Wasserstein distance is a metric and hence follows the triangle inequality, the second inequality follows from the Wasserstein contraction property <ref type="bibr">[9]</ref> of the forward process,</p><p>(a) Objective function (b) Offline examples and the third inequality follows from the total-variation bound <ref type="bibr">[20]</ref> on the 2-Wasserstein distance. Substituting ( <ref type="formula">37</ref>)-( <ref type="formula">40</ref>) into <ref type="bibr">(36)</ref>, with probability &#8805; 1 -&#948; we have for any &#952; &#8712; &#920; and any w &#8712; W</p><p>By assumption, any normalized weight function from W is L-Lipschitz and bounded by B. Therefore, the covering number | W&#1013; | is of the order O(exp(1/&#1013;)). Let &#1013; = m -&#947; for some &#947; &#8712; (0, 1), and we have from (41)</p><p>Choosing &#947; = 1/2 in (42) completes the proof of (1) and hence Theorem A.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Additional Experiments and Details C.1 A toy example</head><p>We first experimentally validate the proposed learning algorithm using a toy example in R 2 . In this example, the unknown objective function f is a mixture of two Gaussian density functions:</p><p>where</p><p>, and the unknown data-generating distribution p data is a mixture of two Gaussian The filled contour plot of the objective function f is shown in Figure <ref type="figure">2a</ref>, with warmer colors representing higher objective values. Figure <ref type="figure">2b</ref> shows 300 samples drawn from the data-generating distribution p data , with the color of each sample rendered according to its ground-truth objective value. These 300 samples and their corresponding objective values are the offline examples from which the weight function w &#981; and the score-function model s &#952; are trained. Figure <ref type="figure">3</ref> shows the optimized samples and the learned weight function w &#981; for several different values of the hyper-parameter &#945; while fixing the hyper-parameter &#955; = 0.1.</p><p>The legitimacy of the proposed approach is demonstrated by the following observations. i) Even though the weight function is not constrained to be monotonic a priori, as shown in the bottom row of Figure <ref type="figure">3</ref>, the learned weight functions are monotone increasing and hence put higher weights to samples with higher objective values. ii) When &#945; = 1, the learned weight function is relatively "flat" across its input domain. As a result, the learned generative model is very close to the data-generating distribution, and the optimized samples are very "similar" to the initial samples. As we decrease the value of &#945; from 1 to 0.3, the learned weight function becomes much more skewed towards the higher input values. As a result, some of the optimized samples have been nudged along the direction of the gradient of the objective function and hence have much higher objective values than the initial samples. When we further decrease the value of &#945; to 0, the learned weight function becomes extremely skewed. In this case, the hypothetical target distribution is not learnable. As a result, instead of the gradient direction, the optimized samples have been nudged along all directions. Therefore, the hyper-parameter &#945; can effectively control the utility-learnability tradeoff for selecting a weight function w &#981; . iii) Compared to the data-generating distribution, with an appropriate choice of the hyper-parameters &#945; and &#955;, the learned generative model is substantially more capable of generating samples with higher objective values, as demonstrated by the differences of the 50th, 80th, and 100th percentiles between the samples drawn from these two distributions. More results on this toy example can be found in Appendix C.4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Benchmark datasets</head><p>We conducted experiments on five standard offline optimization tasks:</p><p>&#8226; Superconductor, which aims to design a superconductor with 86 components to maximize the critical temperature; &#8226; TF (Transcription Factor) Bind 8, which aims to find a DNA sequence of 8 base pairs to maximize its binding affinity to a particular transcription factor; &#8226; GFP (Green Fluorescent Protein), which aims to find a protein sequence of 238 amino acids to maximize the fluorescence; &#8226; UTR (Untranslated Region), which aims to find a human 5' UTR DNA sequence of 50 base pairs to maximize the expression level of its corresponding gene; &#8226; Fluorescence, which aims to identify a protein with high brightness. At each position, the selection of an amino acid is limited to those found in the sequences of the two parent fluorescent proteins. These parent proteins differ at precisely 13 positions in their sequences while being identical at all other positions.</p><p>For all previous tasks except for the Fluorescence, we utilized the Design-Bench package <ref type="bibr">[46]</ref> to generate the training data, pre-process the data (including the conversion of categorical features to numerical values), and evaluate new designs. For the Fluorescence task, we collected raw data from Fannjiang et al. <ref type="bibr">[18]</ref>. The objective value in this case is represented by the combined brightness. From a total of 2 13 = 8192 samples, we selected the worst 4096 examples as our training dataset. While the features in the Fluorescence dataset are binary, we simply treated them as continuous inputs to our algorithm.</p><p>The key attributes of the aforementioned benchmark datasets can be found in Table <ref type="table">2</ref>, which include:</p><p>&#8226; Networks. In our implementation, we used neural networks to model both the (un-normalized ) weight function w &#981; and the score function s &#952; t . The weight function is a simple scalar function. In our implementation, we simply used a 4-layer multi-layer perceptron (MLP) with ReLU activation functions. In addition, we applied an exponential function to the output of the MLP to enforce the non-negativity of the weight function. The architecture for the score function model consists of a time-embedding layer and five blocks of "Dense-BatchNorm-ELU". Before each block, we injected time-embedding information by concatenating it with the input to the block.</p><p>Training. The noise scheduler for the DDPM was chosen as &#946;(t) = &#946; min + (&#946; max -&#946; min )t for t &#8712; [0, 1], where &#946; min = 0.1 and &#946; max = 20. The detailed training procedure is described in Algorithm 1. The training scheme involves first identifying a suitable initialization of &#981; and &#952; and then followed by an alternating maximization over &#981; and &#952;. More specifically, to obtain a suitable initialization of &#981; and &#952;, we first note that the model &#952; only shows up in the second term of our learning objective <ref type="bibr">(2)</ref>. Maximizing the other two terms over &#981; gives us an initial estimate &#981; 0 (see Line 3 of Algorithm 1). In our implementation, this maximization was performed via full-batch gradient descent (GD), for which we used the Adam optimizer <ref type="bibr">[30]</ref> with a constant leaning rate 10 -3 . Once an initial estimate &#981; 0 has been obtained, we can obtain an initial estimate &#952; 0 by minimizing the second term over &#952; while setting &#981; = &#981; 0 (see Line 4 of Algorithm 1). To minimize the weighted denoising score matching loss, we considered a time range of t &#8712; [10 -3 , 1] and used the Adam optimizer with a variable learning rate via stochastic gradient descent (SGD). The learning rate was gradually decreased from 10 -3 to 10 -4 during training. The alternating maximization of the learning objective (2) over the parameters &#981; and &#952; is described in Line 5-8 of Algorithm 1. Again the Adam optimizer was used, and the learning rates were set as &#951; 1 = &#951; 2 = 10 -4 .</p><p>Sampling/Optimization. The sampling/optimization procedure is described in Algorithm 2. This procedure is identical to the probability-flow sampler in Song et al. <ref type="bibr">[44]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4 Additional experimental results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4.1 Toy example</head><p>Additional choices of the hyper-parameter &#945;. Previously in Section C.1, we described a toy example in R 2 and used it to validate our proposed approach. In particular, in Figure <ref type="figure">3</ref> we reported the optimized samples and the learned weight function w &#981; * for several choices of the hyper-parameter &#945;. Here in Figures <ref type="figure">4</ref> and <ref type="figure">5</ref> we report the optimized samples and the learned weight function w &#981; * for some additional choice of the hyper-parameter &#945;. Note that when &#945; = &#8734;, the learned weight function w &#981; * is completely flat across its domain, and thus the hypothetical target distribution q target is identical to the data-generating distribution p data . It should become very clear from these reported results that the hyper-parameter &#945; can effectively control the utility-learnability tradeoff for selecting a weight function. Pre-selected weight function. Instead of considering a learnable weight function w &#981; , we may also consider using a pre-selected weight function to train the generative model p &#952; . Motivated by the learned weight functions w &#981; * from Figure <ref type="figure">5</ref>, here we consider the simple exponential function w &#968; (y) = exp(&#968;y), where &#968; is a hyper-parameter. Note that when &#968; = 0, the weight function w &#968; is completely flat across its domain, and as we increase the value of &#968;, w &#968; becomes increasingly skewed towards the higher values in its domain. The optimized samples and the corresponding pre-selected weight functions are reported in Figures <ref type="figure">6</ref> and <ref type="figure">7</ref>. Note here that we have purposely chosen the values of the hyper-parameter &#968; such that the pre-selected weight functions w &#968; in Figure 7 mimic the learned weight function w &#981; * in Figure 5. As a result, the optimized samples from Figures 6 have similar statistical profiles as those from Figures 4. Next, we shall use the benchmark datasets to illustrate that a learnable weight function can significantly outperform a pre-selected weight function in terms of generating samples with a consistent and superior statistical profile. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4.2 Benchmark datasets</head><p>Here we report additional results on the benchmark datasets using both the learnable weight function w &#981; and the pre-selected exponential weight function w &#968; . In our experiments, we fixed the value of the hyper-parameter &#955; = 0.1 and considered several different values for the hyper-parameter &#945; (learnable weight function) and &#968; (pre-selected weight function). The mean and standard deviation of the best generated samples for each benchmark dataset are reported in Table <ref type="table">3</ref>. The average improvements for different choices of the hyper-parameter &#945; (learnable weight function) and &#968; (pre-selected weight function) are also reported in Table <ref type="table">3</ref>. It is clear that the use of a learnable weight function with &#945; = 0.25 significantly outperform any pre-selected weight function considered here in terms of the average improvement. The learned weight functions w &#981; * that correspond to &#945; = 0.25 for each of the benchmark datasets are reported in Figure <ref type="figure">8</ref>.</p><p>On the top row of Figure <ref type="figure">8</ref>, the dashed lines represent pre-selected weight functions, while the solid line represents the learned weight function for each of the benchmark datasets. It is evident that the learned weight functions differ significantly from the pre-selected exponential weight functions. The label histograms of for each of the benchmark datasets are shown on the bottom row of Figure <ref type="figure">8</ref>. Noticeably, labels of Superconductor and Fluorescence datasets are heavily skewed towards small objective values. Coincidentally, their learned weight functions tend to flatten when the objective values are large (roughly 0.8 -1.0). Given that the majority of data examples in these datasets have small objective values, assigning higher weights to large objective values would result in a smaller "effective sample size" in weighted learning. Therefore, it makes sense that these two datasets adopted a less steep slope in the learned weight function to maintain a reasonable effective sample size. This observation underscores that our learning objective (2) can adaptively choose an appropriate weight function to balance the utility-learnability tradeoff. Thus, to fully harness the potential of the proposed generative approach, it is crucial to make the weight function learnable as well.</p><p>In all the experiments involving a learnable weight function, we consistently set &#955; to 0.1. We also conducted additional experiments to investigate the impact of the hyper-parameter &#955; on the results. The findings, as depicted in Figure <ref type="figure">9</ref>, reveal that &#955; proves to be a relatively insensitive hyper-parameter within our method. Across all datasets, variations in &#955; ranging from 0.01 to 1.0 only result in very minor differences in terms of the optimization performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Further Discussions</head><p>To further elucidate the main contributions of this paper, we put the proposed PAC-generative approach to offline optimization in the context of several related works.</p><p>Modeling target distribution vs. modeling objective function. As mentioned previously in Section 1, the "standard" approach to offline optimization is to first learn a surrogate of the unknown Table <ref type="table">3</ref>: Mean and standard deviation of the best generated samples for different choices of the hyper-parameter &#945; (learnable weight function) and &#968; (pre-selected weight function).  objective function and then apply existing optimization algorithms. The main challenge for modeling the objective function is the so-called distributional shift. That is, when the optimization algorithm explores regions away from the offline observations, the learned surrogate tends to become less accurate. It is thus crucial to understand how far the optimization algorithm can explore away from the offline observations and how to maintain the accuracy of the learned surrogate throughout the exploration process. Notable efforts in the literature include Qi et al. <ref type="bibr">[41]</ref> and Trabucco et al. <ref type="bibr">[45]</ref>, which considered regularized surrogate models in favor of invariance and conservatism; Fannjiang and Listgarten <ref type="bibr">[17]</ref> and Chen et al. <ref type="bibr">[13]</ref>, which considered surrogate models learned via importance sampling and contrastive learning; and Fannjiang et al. <ref type="bibr">[18]</ref>, which used conformal prediction to quantify the uncertainty of the learned surrogate.</p><p>Despite these efforts, however, it remains unclear how to align the objective of learning a surrogate of the unknown objective function with the objective of optimization. This is evidenced by the very recent work Beckham and Pal <ref type="bibr">[7]</ref>, which discussed how one may interpret the conservative approach proposed in Trabucco et al. <ref type="bibr">[45]</ref>, and Beckham et al. <ref type="bibr">[8]</ref>, which suggested that an alternative evaluation metric is potentially better than simply choosing the best candidates using the learned surrogate. In contrast, the PAC-generative approach proposed in this paper is based on modeling a target distribution (as opposed to the objective function). As we have shown, under this generative view, it is possible to tune the objective of the learner according to a natural optimization objective.</p><p>Weighted learning vs. conditional/guided generation. Recent years have seen remarkable success in conditional/guided image generation <ref type="bibr">[15,</ref><ref type="bibr">23]</ref>. Conditional/guided generation can be easily adapted to offline optimization. Specifically, one can simultaneously learn a standard score-based generative model and a surrogate of the objective function and then use the gradient of the learned surrogate to guide the generation of the optimized samples <ref type="bibr">[40]</ref>. Alternatively, one may also model the target distribution q target as the conditional distribution p data given f (x) &#8805; y 0 for some threshold y 0 and train a generative model that approximates this conditional distribution <ref type="bibr">[11,</ref><ref type="bibr">22]</ref>. However, learning the conditional distribution may also require a surrogate of the objective function. In contrast, in our approach we model the target distribution q target using a weight function. As discussed in Section 2.1, in our weighted-learning model, the score of q target is intrinsically aligned with the gradient of the objective function. Hence we directly train a generative model from the offline data examples to learn the score of q target , and there is no need to learn a surrogate of the objective function separately.</p><p>Offline optimization vs. offline reinforcement learning (RL). While the focus of this paper is offline optimization, recent years have also seen a substantial amount of interest in offline RL <ref type="bibr">[36,</ref><ref type="bibr">49,</ref><ref type="bibr">27,</ref><ref type="bibr">50]</ref>. Even though these two problems face some similar challenges, in our evaluation offline RL is the considerably more challenging setting. It is thus of interest to see whether the proposed PAC-generative approach can lead to any success in offline RL as well.</p><p>D3S3@NeurIPS Paper Checklist (Optional)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Claims</head><p>Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?</p><p>Answer: <ref type="bibr">[Yes]</ref> Justification: We derived a PAC-style lower bound on optimization objective and learned generative models based on the surrogate loss function.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the abstract and introduction do not include the claims made in the paper. &#8226; The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. &#8226; The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. &#8226; It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Limitations</head><p>Question: Does the paper discuss the limitations of the work performed by the authors?</p><p>Answer: <ref type="bibr">[Yes]</ref> Justification: We didn't have a separate 'Limitations' section, while we had a thorough discussion in Appendix D to compare with other methods.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. &#8226; The authors are encouraged to create a separate "Limitations" section in their paper.</p><p>&#8226; The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. &#8226; The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. &#8226; The authors should reflect on the factors that influence the performance of the approach.</p><p>For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. &#8226; The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. &#8226; If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. &#8226; While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Theory Assumptions and Proofs</head><p>Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?</p><p>Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?</p><p>Answer: [Yes] Justification: Our code includes a README file with instructions to run the code.</p><p>Guidelines:</p><p>&#8226; The answer NA means that paper does not include experiments requiring code.</p><p>&#8226; Please see the NeurIPS code and data submission guidelines (<ref type="url">https://nips.cc/  public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). &#8226; The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (<ref type="url">https:  //nips.cc/public/guides/CodeSubmissionPolicy</ref>) for more details. &#8226; The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. &#8226; The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. &#8226; At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). &#8226; Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Please check Appendix C.3 and source code if necessary.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. &#8226; The full details can be provided either with the code, in appendix, or as supplemental material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experiment Statistical Significance</head><p>Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?</p><p>Answer: [Yes] Justification: We ran our experiments for 8 trials and considered the statistical numbers as final criteria.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. &#8226; The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).</p><p>&#8226; The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) &#8226; The assumptions made should be given (e.g., Normally distributed errors). &#8226; It should be clear whether the error bar is the standard deviation or the standard error of the mean. &#8226; It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. &#8226; For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). &#8226; If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Experiments Compute Resources</head><p>Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?</p><p>Answer: <ref type="bibr">[No]</ref> Justification: Because our experiments do not require heavy computing resources. A normal PC with GPU(s) can run this code efficiently.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. &#8226; The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. &#8226; The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Code Of Ethics</head><p>Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <ref type="url">https://neurips.cc/public/EthicsGuidelines</ref>?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: [Yes]</head><p>Justification: We anonymized our code in order to obey the Code of Ethics.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.</p><p>&#8226; If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. &#8226; The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Broader Impacts</head><p>Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?</p><p>Answer: <ref type="bibr">[No]</ref> Justification: This work only focused on methodology.</p><p>Guidelines:</p><p>&#8226; The answer NA means that there is no societal impact of the work performed.</p><p>&#8226; If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.</p><p>&#8226; Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. &#8226; The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. &#8226; The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. &#8226; If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11.">Safeguards</head><p>Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines:</p><p>&#8226; The answer NA means that the paper poses no such risks.</p><p>&#8226; Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. &#8226; Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. &#8226; We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We have properly mentioned or cited assets owned by others. Guidelines:</p><p>&#8226; The answer NA means that the paper does not use existing assets.</p><p>&#8226; The authors should cite the original paper that produced the code package or dataset.</p><p>&#8226; The authors should state which version of the asset is used and, if possible, include a URL. &#8226; The name of the license (e.g., CC-BY 4.0) should be included for each asset.</p><p>&#8226; For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. &#8226; If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Throughout the paper, better or superior solutions refer to those with either larger or smaller objective values, depending on whether the goal of optimization is maximizing or minimizing the objective function.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>We emphasize here that the idea of leveraging generative models for offline optimization is not entirely new, e.g. Brookes et al.<ref type="bibr">[11]</ref>, Krishnamoorthy et al.<ref type="bibr">[34,</ref><ref type="bibr">33]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>The readers are referred to Section 3.1 of Kwon et al.<ref type="bibr">[38]</ref> for the assumptions under which the inequality (7) holds.</p></note>
		</body>
		</text>
</TEI>
