<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>The variational method of moments</title></titleStmt>
			<publicationStmt>
				<publisher>Oxford</publisher>
				<date>04/27/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10467027</idno>
					<idno type="doi">10.1093/jrsssb/qkad025</idno>
					<title level='j'>Journal of the Royal Statistical Society Series B: Statistical Methodology</title>
<idno>1369-7412</idno>
<biblScope unit="volume">85</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>Andrew Bennett</author><author>Nathan Kallus</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>The conditional moment problem is a powerful formulation for describing structural causal parameters in terms of observables, a prominent example being instrumental variable regression. We introduce a very general class of estimators called the variational method of moments (VMM), motivated by a variational minimax reformulation of optimally weighted generalized method of moments for finite sets of moments. VMM controls infinitely for many moments characterized by flexible function classes such as neural nets and kernel methods, while provably maintaining statistical efficiency unlike existing related minimax estimators. We also develop inference algorithms and demonstrate the empirical strengths of VMM estimation and inference in experiments.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>For many problems in fields such as economics, sociology, or epidemiology, we seek to use observational data to estimate structural parameters, which often describe some causal relationship. A common framework that unifies many such problems is the conditional moment problem, which assumes that the parameter of interest &#952; 0 is the unique element of some parameter space &#920; such that</p><p>where X &#8712; X denotes the observed data, Z &#8712; Z is a random variable that is measurable with respect to X, and &#961; : X &#8594; R m is a vector-valued function indexed by &#920;. Note that Equation ( <ref type="formula">1</ref>) is an identity of random variables, not of numbers; that is, it holds almost surely with respect to the random Z.</p><p>That Z is measurable with respect to X is without loss of generality, since given any X, Z we may define X = ( X, Z) as the observed data; thus &#961; may potentially depend on all data. Note also that at this point we let &#920; be general; for example, it may be finite dimensional or it may be a class of functions.</p><p>Example 1 Perhaps, the most common example of a conditional moment problem is the instrumental variable regression problem (see e.g., <ref type="bibr">Angrist &amp; Pischke, 2008, and citations therein)</ref>, where we seek to estimate the causal effect of some treatment T on an outcome Y, where the observed relationship between T and Y may be confounded by some unobserved variables, but we have an instrumental variable Z that affects T but only affects Y via its effect on T. Given some regression function g parameterized by &#952; &#8712; &#920;, the value &#952; 0 corresponding the true regression function is assumed to be the unique solution to</p><p>This is an example of Equation ( <ref type="formula">1</ref>) with X = (T, Y, Z) and &#961;(X; &#952;) = Yg(T; &#952;). More intricate variants of this, for example, include <ref type="bibr">Berry et al. (1995)</ref>, which incorporate discrete choice and is widely used to formulate structural demand parameters in industrial organizations. Example 2 A second example is instrumental quantile regression <ref type="bibr">(Chernozhukov et al., 2007;</ref><ref type="bibr">Horowitz &amp; Lee, 2007)</ref>. Here, again assume a treatment T, outcome Y, and instrumental variable Z, but now we seek to estimate the causal effect of T on the p th quantile of Y, for some 0 &lt; p &lt; 1. In this case, given a quantile regression function g parameterized by &#952; &#8712; &#920;, the value &#952; 0 corresponding to the true quantile regression function is assumed to be the unique solution to</p><p>This is an example of Equation ( <ref type="formula">1</ref>) with X = (T, Y, Z) and &#961;(X; &#952;) = 1{Y &#8804; g(T; &#952;)} -p.</p><p>Example 3 A third example of a problem is estimating the stationary state density ratio between two policies in offline reinforcement learning <ref type="bibr">(Bennett et al., 2021;</ref><ref type="bibr">Kallus &amp; Uehara, 2022;</ref><ref type="bibr">Liu et al., 2018)</ref>. Consider a Markov decision process given by an unknown transition kernel p(S &#8242; | S, A) describing the distribution of next state S &#8242; when action A is taken in previous state S. Suppose Then, for example, if &#920; satisfies &#8747; d(s; &#952;)d&#956;(s) = 1 &#8704;&#952; &#8712; &#920; for some fixed measure &#956;, and there exists some &#952; 0 &#8712; &#920; such that d(S; &#952; 0 ) &#8733; d(S), then this conditional moment restriction will identify &#952; 0 . Note that although d(S; &#952; 0 ) &#8800; d(S) in general, estimates of &#952; 0 are still of interest, as they could be used to estimate d(S) in downstream tasks, for example by dividing by a plug-in estimate of E[d(S; &#952; 0 )] using estimates of &#952; 0 and E [since d(S) is known to satisfy the normalization constraint E[d(S)] = 1.] This is an example of Equation ( <ref type="formula">1</ref>) with X = (S, A, S &#8242; ), Z = S &#8242; , and &#961;(X; &#952;) = d(S; &#952; 0 )&#960; e (A | S)&#960; -1 b (A | S) -d(S &#8242; ; &#952; 0 ).</p><p>The classic approach to the conditional moment problem is to reduce it to a system of k marginal moments, E[F(Z)&#961;(X; &#952; 0 )] = 0, where F : Z 7 ! R k&#215;m is a chosen matrix-valued function. Then, we can apply the optimally weighted generalized method of moments (OWGMM; <ref type="bibr">Hansen, 1982)</ref>, which we present in detail in Section 2.1. Since this marginal moment formulation is implied by Equation (1) but not necessarily vice versa, this requires us to we find a sufficiently rich F(Z) such that the marginal moment problem still identifies &#952; 0 , that is, it is still the unique solution in &#920;. Moreover, even if this identifies &#952; 0 and even though OWGMM is efficient in the model implied by E[F(Z)&#961;(X; &#952; 0 )] = 0, the result may not be efficient in the model implied by Equation (1).</p><p>There are a few general approaches to deal with this. There are classic nonparametric approaches that are sieve-based and simply grow k, the output dimension of F(Z), with n by including additional functions from a basis for L 2 such as power series <ref type="bibr">(Chamberlain, 1987)</ref>. There are also classic nonparametric approaches that directly estimate some special identifying F * (Z) that also induces an efficient OWGMM <ref type="bibr">(Newey, 1990</ref><ref type="bibr">(Newey, , 1993))</ref>. For example, in Example 1 with g(T; &#952;) = &#952; &#8868; T, we have F * (Z) = E[T | Z], which can be nonparametrically estimated and plugged into OWGMM. Furthermore, there are approaches that used sieve-based methods to simultaneously estimate E[&#961;(X; &#952;) | Z] for every &#952; &#8712; &#920;, and pick &#952; to minimize some weighted empirical norm of these estimated conditional expectations <ref type="bibr">(Ai &amp; Chen, 2003;</ref><ref type="bibr">Chen &amp; Pouzo, 2009</ref><ref type="bibr">, 2012;</ref><ref type="bibr">Newey &amp; Powell, 2003)</ref>.</p><p>A recent line of work instead focuses on tackling this with machine-learning-based approaches <ref type="bibr">(Bennett et al., 2019;</ref><ref type="bibr">Dikkala et al., 2020;</ref><ref type="bibr">Hartford et al., 2017;</ref><ref type="bibr">Kallus et al., 2021;</ref><ref type="bibr">Lewis &amp; Syrgkanis, 2018;</ref><ref type="bibr">Muandet et al., 2019;</ref><ref type="bibr">Singh et al., 2019;</ref><ref type="bibr">Uehara et al., 2021)</ref>. These approaches are varied, with some solving the general problem in Equation (1) and others solving the more specific instrumental variable regression problem or other specific problems, with approaches based on deep learning, kernel methods, or both. Most of these are based on an adversarial/minimax/ saddle-point approach <ref type="bibr">(Bennett et al., 2019;</ref><ref type="bibr">Dikkala et al., 2020;</ref><ref type="bibr">Kallus et al., 2021;</ref><ref type="bibr">Lewis &amp; Syrgkanis, 2018;</ref><ref type="bibr">Muandet et al., 2019;</ref><ref type="bibr">Uehara et al., 2021)</ref>.</p><p>Currently, there is a disconnect between these two lines of approaches. On the one hand, the more classical approaches are well motivated by efficiency theory when we impose certain smoothness assumptions. This is in contrast with the recent machine-learning based approaches; while some provide consistency guarantees <ref type="bibr">(Bennett et al., 2019)</ref> and even rates <ref type="bibr">(Dikkala et al., 2020;</ref><ref type="bibr">Kallus et al., 2021;</ref><ref type="bibr">Singh et al., 2019;</ref><ref type="bibr">Uehara et al., 2021)</ref>, none of these approaches are shown to be semiparametrically efficient for Equation (1) or can facilitate inference on &#952; 0 . On the other hand, however, the more recent line of work leverages modern machine learning approaches, which are commonly believed to have superior practical properties. For example, they have been empirically observed to be more stable, have easier parameter tuning, or be better able to adapt to the low-dimensional latent structure of complex data. Although our experiments do indeed seem to support this thesis, especially in more challenging settings, we emphasise that the point of this paper is not to demonstrate that modern machine learning-based approaches are superior to classical ones. Rather, we observe that for various reasons there is significant, growing interest in machine-learning based approaches to these problems within the community, and therefore extending this line of work to be semiparametrically efficient and to perform inference is of great importance.</p><p>In this paper, we study a general class of minimax approaches, which we call the variational method of moments (VMM). This generalizes the method of <ref type="bibr">Bennett et al. (2019)</ref>, who presented an estimator for instrumental variable regression using adversarial training of neural networks. Their proposal was motivated by a variational reformulation of OWGMM, aiming to combine the efficiency of more classical approaches with the flexibility of machine learning methods. This style of estimator has since been applied to a variety of other conditional moment problems including policy learning from observational data <ref type="bibr">(Bennett &amp; Kallus, 2020)</ref> and estimating stationary state density ratios <ref type="bibr">(Bennett et al., 2021)</ref>. However, this past work did not provide a general formulation of VMM and a detailed theoretical analysis. And, although they are motivated by efficiency considerations, it is not immediately clear that this actually leads to efficient estimators.</p><p>We present a unified theory for a general class of VMM estimators. In particular, for some specific versions of these estimators based on either deep learning or kernel methods, we provide appropriate assumptions under which these methods are consistent, asymptotically normal, and semiparametrically efficient. In addition, we provide inference algorithms for these estimators, which can be used to construct confidence intervals for the estimated parameters. These inference algorithms are based on the same kind of variational reformulation as the estimation algorithms themselves, again with varieties based on both kernel methods and deep learning. Finally, we provide a detailed series of experiments that demonstrate that these VMM algorithms obtain very good finite-sample estimation performance and that the corresponding inference algorithms produce high quality confidence intervals.</p><p>The rest of this paper is structured as follows: in Section 2, we define the VMM estimator and provide motivation for it by interpreting OWGMM as a specific case thereof; in Section 3, we provide our theory for kernel VMM estimators, which are a specific instance of VMM estimators based on kernel methods; in Section 4, we provide our theory for neural VMM estimators, which are an alternative instance of VMM based on deep learning methods; in Section 5, we present our inference theory, with proposed kernel-and neural net-based algorithms; in Section 7, we provide a detailed empirical evaluation of our proposed estimation and inference methods; and in Section 8 we provide a detailed discussion of past work on solving conditional moment problems and how these approaches relate to our VMM estimators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>812</head><p>Bennett and Kallus</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notation</head><p>We use uppercase letters such as X to denote random variables and lowercase ones to denote nonrandom quantities. The set of positive integers is N, and for any n &#8712; N we use [n] to refer to the set {1, . . . , n}. We denote by &#8214; &#8226; &#8214; Lp the usual L p functional norm, defined as &#8214;f</p><p>where the probability measure is implicit from context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Variational method of moments</head><p>We now define the class of VMM estimators. We consider data consisting of n independent and identically distributed observations of X, namely, X 1 , . . . , X n &#8764; P, where P denotes the data distribution. Let some sequence of function classes F n be given, such that each f &#8712; F n has signature f : Z &#8594; R m . Let a 'prior estimate' &#952;n &#8712; &#920; be given. In general, this may be any data-driven choice from &#920; and need not necessarily be consistent for &#952; 0 ; in the theory that follows we will elaborate on what conditions &#952;n needs to satisfy for our respective results. Furthermore, let R n : F n &#8594; [0, &#8734;] be some optional regularizer, which measures the complexity of f &#8712; F n . Then, we define the VMM estimate</p><p>) corresponding to these choices as follows:</p><p>where E n is an empirical average over the n data points.</p><p>In Section 3, we study the instantiation of this with F n being a reproducing kernel Hilbert space (RKHS). In Section 4, we study the instantiation with F n being a class of neural networks.</p><p>Before proceeding to study these new machine-learning-based instantiations of the VMM estimator with flexible choices for F n , we discuss a very simple instantiation that recovers OWGMM, which provides motivation and interpretation for each of the terms in Equation (2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">The optimally weighted generalized method of moments</head><p>First, we present the classic OWGMM method. Given F(Z) = (f 1 (Z), . . . , f k (Z)), we obtain the marginal moment conditions E[f i (Z) &#8868; &#961;(X; &#952; 0 )] = 0 &#8704;i &#8712; [k]. Let a 'prior estimate' &#952;n be given and define the matrix &#915; as</p><p>Then, the OWGMM estimate</p><p>Given certain regularity conditions and assuming the choice of functions f 1 , . . . , f k are sufficient such that the corresponding k moment conditions uniquely identify &#952; 0 , standard GMM theory says that &#952;n is consistent for &#952; 0 . Furthermore, if the prior estimate &#952;n is consistent for &#952; 0 , then this estimator is efficient with respect to the model defined by these k moment conditions <ref type="bibr">(Hansen, 1982)</ref>. OWGMM generalizes the method of moments, which solves</p><p>When there are many moments, we cannot make all of them zero due to finite-sample noise and instead we seek to make them near zero. But, it is not clear which moments are more important; for example, there may be duplicate or near-duplicate moments. The key to OWGMM's efficiency is to optimally combine the k objectives of making each moment near zero into a single objective function. To get a consistent prior estimate, we can for example let &#952;n itself be a OWGMM with any fixed prior estimate, leading to the two-step GMM estimator. This can be repeated, leading to the multi-step GMM estimator.</p><p>Unfortunately, estimators of this kind have many limitations. For one, in practice, it is difficult or impossible to verify that any such set of functions f 1 , . . . , f k are sufficient for identification. In addition, while such an estimator is efficient with respect to the model imposed by these k moment conditions, ideally we would like to be efficient with respect to the model given by Equation (1); that is, we would wish to be efficient with respect to the model given by all moment conditions of the form E[f (Z) &#8868; &#961;(X; &#952; 0 )] = 0 for square integrable f. Finally, in the case that k were very large and growing with n, as would be required to (at least approximately) alleviate the prior two concerns, the corresponding sieve-based estimator would require impractical tuning to select which basis of L 2 to use and to choose k as a function of n. As will be seen below, such an approach may be seen as equivalent to estimating the optimal instruments over a linear sieve, but unlike our variational approach that we propose below it is unclear how to appropriately regularize this sieve estimation and take advantage of modern machine learning advances on non-parametric function approximation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Variational reformulation of OWGMM</head><p>One motivation for our VMM class of estimators, Equation ( <ref type="formula">2</ref>), is that it recovers OWGMM with its efficient weighting.</p><p>The following result simply appeals to the optimization structures of Equations ( <ref type="formula">2</ref>) and (3) and generalizes <ref type="bibr">Bennett et al. (2019, Lemma 1)</ref>. We include its proof as it is short and instructive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 1 &#952;OWGMM</head><p>where the second equality is a reformulation of the rotated Euclidean norm (see Online Supplementary Material, Lemma 15 for the general Hilbert-space version). The conclusion follows by noting</p><p>Through the lens of Lemma 1, we can understand each term of Equation ( <ref type="formula">2</ref>) as follows. The first term pushes &#952; to make</p><p>, appropriately weights the relative importance of making each of these near zero. Finally, varying F n and/or R n (f ) with n allows us to control the richness of moments that we consider, in analogy to sieve-based methods that grow the dimension of the space span({f 1 , . . . , f k }) but admitting more flexible machine-learning approaches. This motivation is similar to <ref type="bibr">Bennett and Kallus (2020)</ref>; <ref type="bibr">Bennett et al. (2019</ref><ref type="bibr">Bennett et al. ( , 2021))</ref>, but these did not study the problem in generality or establish properties such as asymptotic normality or efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Kernel VMM</head><p>First, we consider a class of VMM estimators where for every n we have F n = F , where F = &#1113932; m i=1 F i and each F i is a RKHS of functions Z &#8594; R given by a symmetric positive definite kernel K i : Z &#215; Z &#8594; R, and regularization is performed using the RKHS norm of F , which we denote by</p><p>We will call these estimators kernel VMM estimators, which we concretely define according to</p><p>and &#945; n is some non-negative sequence of regularization coefficients. Explicitly, this fits into our general VMM definition with F n = F for every n, and R n (f ) = &#945; n &#8214;f &#8214; 2 . Before we provide our main theory for kernel VMM estimators, we provide a convenient reformulation of Equation ( <ref type="formula">4</ref>). Let H be the dual space of F (that is, the space of all bounded linear functionals of the form F 7 ! R) and for each &#952; &#8712; &#920; define the element &#773; h n (&#952;) &#8712; H according to</p><p>Furthermore, define the linear operator C n : H &#8594; H according to</p><p>where &#966; : H &#8594; F maps any element in H to its Riesz representer in F such that h(f ) = &#9001;&#966;(h), f &#9002;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 2</head><p>The kernel VMM estimator defined in Equation ( <ref type="formula">4</ref>) is equivalent to</p><p>where I is the identity operator Ih = h.</p><p>We note that comparing this result to Equation (3), this is a clear infinite-dimensional generalization of the OWGMM objective, where the matrix &#915; defined there is replaced with a linear operator, and the inversion is performed using Tikhonov regularization. Note that this re-framing of our kernel VMM estimator also shows a connection to the continuum GMM estimators considered by <ref type="bibr">Carrasco and Florens (2000)</ref>. However, our estimator does not strictly fit within their framework. We discuss this in more detail in Section 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Consistency</head><p>We first provide some sufficient assumptions in order to ensure that our kernel VMM estimator is consistent; that is, &#952;K-VMM n &#8594; &#952; 0 in probability. Before we present these assumptions, we define the conditional covariance function of the moment problem</p><p>(5)</p><p>For our first assumption, we require each F i to be universally approximating with a smooth kernel. Recall for this definition that, a function is C &#8734; -smooth if it is n-times continuously differentiable for every positive integer n. In addition, we recall that a kernel is universal if the corresponding RKHS is dense in the space of continuous real-valued functions on Z under the supremum norm <ref type="bibr">(Sriperumbudur et al., 2011)</ref>. Note that all of the properties of the following assumption hold, for example, for the commonly used Gaussian kernel.</p><p>Assumption 1 (Universal RKHS). For each i &#8712; [m], K i is C &#8734; -smooth in both arguments and F i is universal.</p><p>Next, we require a basic regularity condition on the observed data distribution. This together with Assumption 1 ensure that F is well-behaved with respect to &#961; and satisfies some nice properties in terms of boundedness and metric entropy, as formalized by Online Supplementary Material, Lemma 16 in the Appendix.</p><p>Assumption 2 (Regularity). Z is a bounded subset of R dz for some positive integer d z .</p><p>Next, we require that the set of possible functions {&#961;( &#8226; ; &#952;) : &#952; &#8712; &#920;} satisfies some basic boundedness, smoothness, and complexity properties. A simple example satisfying the below is for &#920; to be a compact set in some finite-dimensional Euclidean space, and for &#961;(x; &#952;) to be equi-Lipschitz continuous in &#952; for every x. Other examples that easily satisfy the second part of the below include {&#952;( &#8226; ; &#952;) : &#952; &#8712; &#920;} having finite Vapnik-Chervonenkis dimension (see e.g., <ref type="bibr">Kosorok, 2007, Theorem 8.19 and Corollary 9.5)</ref>, or be a bounded-norm subset of an RKHS (see Online Supplementary Material, Lemma 17 in the Appendix for details). This assumption ensures that consistent estimation of &#952; 0 is possible, even though inversion of the conditional moment operator could be ill-posed.</p><p>Assumption 3 (Moment Class Complexity). sup x&#8712;X ,&#952;&#8712;&#952; |&#961;(x; &#952;)| &lt; &#8734;, and &#961;(X; &#952;) is Lipschitz continuous in &#952; under the L 1 norm. Also, for each i &#8712; [m], the function set {&#961; i ( &#8226; ; &#952;) : &#952; &#8712; &#920;} is P-Donsker.</p><p>We also assume that the prior estimate &#952;n is well-behaved, meaning that it converges sufficiently fast to some limit in probability. This limit need not be &#952; 0 for our consistency results. This will be used to ensure the convergence of the linear operator C n defined above to some limiting operator C.</p><p>Assumption 4 (Convergent Prior Estimate). The prior estimate &#952;n has a limit &#952; in probability, and satisfies &#8214;&#961; i (X; &#952;n ) -&#961; i (X; &#952;)&#8214; 2 = O p (n -p ) for every i &#8712; [m] and some 0 &lt; p &#8804; 1/2.</p><p>Finally, we assume a nonsingular covariance with bounded inverse moments.</p><p>Assumption 5 (Non-Degenerate Moments). For each &#952; &#8712; { &#952;, &#952; 0 }, we have that V(Z; &#952;) is invertible almost surely, and also that &#8214;&#963; min (Z; &#952;) -1 &#8214; &#8734; &lt; &#8734;, where &#963; min (Z; &#952;) denotes the minimum eigenvalue of V(Z; &#952;).</p><p>Assumption 5 is slightly subtle and is used to ensure that the objective J n defined above converges to a well-behaved limiting objective J that is uniquely minimized by &#952; 0 , which is central to our consistency proof. In the absence of this assumption, it is possible that the limiting objective may diverge. We note that in the case of m = 1, the second part of the assumption is equivalent to requiring that &#8214;V(Z; &#952;) -1 &#8214; &#8734; , &#8214;V(Z; &#952; 0 ) -1 &#8214; &#8734; &lt; &#8734;, and in the case that the prior estimate &#952;n is consistent we only need this condition to hold at &#952; 0 = &#952;. In general, it can be viewed in terms of certain moments defined in terms of the data distribution and &#961; being bounded.</p><p>With these assumptions, we are prepared to state our consistency result.</p><p>Theorem 1 (Consistency). Let Assumptions 1-5 be given and suppose the regularization coefficient satisfies &#945; n = o(1) and &#945; n = &#969;(n -p ), where p is the constant referenced in Assumption 4. Then, for any &#952;n that satisfies J n ( &#952;n ) = inf &#952;&#8712;&#920; J n (&#952;) + o p (1), we have &#952;n &#8594; &#952; 0 in probability.</p><p>Comparing this result to the corresponding consistency result given by <ref type="bibr">Bennett et al. (2019, Theorem 2)</ref>, we note that this result does not rely on any specific identification assumptions beyond Equation (1). Conversely, <ref type="bibr">Bennett et al. (2019)</ref> assume that the class F of neural nets that they take a supremum over is sufficient to uniquely identify &#952; 0 , which is a questionable assumption since this class is assumed to be fixed and not growing with n. Therefore, we argue that our VMM consistency here is given under much more reasonable assumptions.</p><p>Next, we make some observations about how this result compares with consistency results in the literature that tackles nonlinearities using sieves. First, note that Assumption 3 is weaker than the corresponding assumptions in <ref type="bibr">Ai and Chen (2003)</ref> and <ref type="bibr">Newey and Powell (2003)</ref>, who assume that &#961;(x; &#952;) is point-wise H&#246;lder-continuous and &#920; is compact. Instead, we require the more general assumption of continuity in the L 1 -norm, along with a Donsker condition. Conversely, <ref type="bibr">Chen and Pouzo (2009)</ref> and <ref type="bibr">Chen and Pouzo (2012)</ref> similarly allow for non-smooth &#961;, but they consider the setting where &#920; can be non-compact, introducing ill-posedness issues that they tackle in their work. Rather, we specifically consider metrics on &#920; under which ill-posedness is not an issue, given our Donsker assumption on {&#961;( &#8226; ; &#952;) : &#952; &#8712; &#920;} and L 1 -continuity. Furthermore, we note that assumptions similar to Assumptions 2 and 5 are standard in these past works, and Assumptions 1 and 4 are straightforward technical conditions related to implementation choices for our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Asymptotic normality</head><p>We now present our theory for the asymptotic normality of kernel VMM estimates. Here, we consider the special case where &#920; is a compact subset of R b for some positive integer b. We note that in this case, as discussed above, Assumption 3 follows under very simple additional conditions; e.g., &#961;(x; &#952;) being equi-Lipschitz continuous in &#952; for every x &#8712; X . Under this setting, we will characterize the asymptotic distribution of &#65533;&#65533; n &#8730; ( &#952;n -&#952; 0 ). First, we require that &#961;(X; &#952;) satisfies the following differentiability condition.</p><p>Assumption 6 (&#961; Differentiable in Absolute Mean). For each i &#8712; [m], there exists some vector-valued function D i (X; &#952;) &#8712; R b indexed by &#952;, and some neighbourhood &#920; 0 of &#952; 0 , such that, for every &#952; &#8712; &#920; 0 , we have lim</p><p>In other words, D i is a gradient-like function such that the first-order Taylor error decays to zero at a o(&#8214;&#952; &#8242; -&#952;&#8214;) rate under the L 1 norm. For example, in the case that &#961; i (x; &#952;) is continuously differentiable in &#952; within some neighbourhood of &#952; 0 for all x &#8712; X , then Assumption 6 trivially follows from Taylor's theorem. Furthermore, it is easy to see that for any x, &#952; where &#961; i (x; &#952;) is differentiable w.r.t. &#952;, we must have D i (x; &#952;) = &#8711;&#961; i (x; &#952;). However, the above is more general and allows for situation where &#961; i (x; &#952;) is non-differentiable at some values of x and &#952;. In particular, the following lemma allows us to establish this assumption under more general conditions. Lemma 3 Suppose there exist &#981; i : X &#8594; R indexed by &#920;, and 'gradient-like' and 'Hessian-like' functions &#961; &#8242; i and &#961; &#8242;&#8242; i such that: (1) &#961; i (X; &#952;) is twice differentiable in &#952; with gradient &#961; &#8242; i (X; &#952;) and Hessian &#961; &#8242;&#8242; i (X; &#952;) whenever &#981;(X; &#952;)</p><p>for some L &#981; (x) such that the probability density of the random variable L &#981; (X) -1 &#981;(X; &#952;) is bounded within some neighbourhood of zero. Then, we have that Assumption 6 holds with</p><p>This lemma allows us to establish Assumption 6 for a range of problems where &#961;(X; &#952;) has some points of non-smoothness. Intuitively, the boundedness condition on &#961; &#8242;&#8242; allows us to bound the first-order Taylor error whenever &#961; is smooth, and the Lipschitz and bounded density assumptions on &#981; near &#981; = 0 prevents non-smoothness from impacting the first-order Taylor expansion, up to an additional o(&#8214;&#952; &#8242; -&#952;&#8214;) factor.</p><p>Next, we let D(X; &#952;) &#8712; R m&#215;b denote the Jacobian-like function given by concatenating</p><p>where H is the dual space of F , as above. We also define the analogue of the gradient of the objective</p><p>and note that in the case that &#961;(X</p><p>In addition, we define linear operators C : H &#8594; H and C 0 :</p><p>where &#952; is the probability limit of &#952;n as specified by Assumption 4, and &#966; is defined as in the definition of C n above. Given these definitions, we can now specify our additional assumptions and the asymptotic normality result.</p><p>This next additional assumption is a regularity condition on D(X; &#952;), which extends the properties of &#961;(X; &#952;) specified in Assumption 3 to D(X; &#952;) j for each j &#8712; [b].</p><p>Assumption 7 (Gradient Complexity). Let &#920; 0 be the neighbourhood of &#952; 0 from Assumption 6.</p><p>Next, we assume a certain non-degeneracy in the parametrization of the problem, locally near &#952; 0 .</p><p>Assumption 8 is needed to ensure that the limiting asymptotic variance is finite and that the matrix &#937; defined in the theorem statement below is invertible. It can be interpreted as the assumption that the parametrization of &#920; is non-degenerate, since it requires that the functions E[&#961; &#8242; i (X; &#952; 0 ) | Z] are linearly independent. Note that this assumption is somewhat lax, since if it were violated, it is likely possible we could re-parameterize the problem with a lower-dimensional &#920; in order to avoid this issue.</p><p>Finally, we will need to introduce a couple of important definitions. We say that an estimator &#952;n for &#952; 0 is asymptotically linear if &#952;n = E n [&#968;(X)] + o p (n -1/2 ), for some &#968; satisfying E[&#968;] = &#952; 0 . n addition, we say that such an estimator is asymptotically normal if &#65533;&#65533; n &#8730; ( &#952;n -&#952; 0 ) converges in distribution to a mean-zero Gaussian random variable, with some fixed covariance matrix.</p><p>With these additional assumptions and definitions, we are prepared to present our asymptotic normality result.</p><p>Theorem 2 (Asymptotic Normality). Let Assumptions 1-8 be given, and suppose the regularization coefficient satisfies &#945; n = o(1) and &#945; n = &#969;(n -p ), where p is the constant defined in Assumption 4. Then, for any &#952;n that satisfies</p><p>is asymptotically linear and asymptotically normal, with covariance matrix &#937; -1 &#916;&#937; -1 , where &#916; and &#937; are defined according to</p><p>Note that this theorem requires an approximate first-order optimality condition, namely, that the gradient-like element</p><p>), which is stronger than the approximate optimality condition in Theorem 1. Although this condition may be difficult to interpret or verify in general, the following lemma provides some sufficient conditions.</p><p>Lemma 4 (Sufficient Conditions for Approximate First-Order Optimality). Suppose that either (1) &#952;n &#8712; arg min &#952; J n (&#952;); or (2) &#961;(x; &#952;) is twice continuously differentiable in &#952; for every x &#8712; X , and</p><p>Then, given the other conditions of Theorem 2, we have</p><p>Comparing this result to comparable results in the literature leveraging more classical nonparametric approaches, we note that our differentiability condition in Assumption 6 is weaker than the point-wise differentiability of &#961;(X; &#952;) assumed by <ref type="bibr">Ai and Chen (2003)</ref>, but stronger than <ref type="bibr">Chen and Pouzo (2009)</ref> who only require differentiability of E[&#961;(X; &#952;) | Z]. We also note that, these two works further allow for non-parametric nuisance functions in addition to the asymptotically normal parametric component. Furthermore, we note that Assumption 8 is a standard condition in all of these works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Efficiency</head><p>Next, we address the question of efficiency of these kernel VMM estimators. In order to present this theory, we first need to introduce the notions of regularity and semiparametric efficiency; we refer the reader to <ref type="bibr">Van der Vaart (2000)</ref> for precise definitions. Roughly speaking, we say that an estimator &#952;n is regular with respect to some model of distributions if it is sufficiently well behaved such that its asymptotic behaviour is invariant to small perturbations [of size O p (n -1/2 )] to the data-generating distribution that remain inside the model. In addition, we say that &#952;n is semiparametrically efficient with respect to a model of distributions if it is regular and achieves the minimum asymptotic variance among all regular estimators (with respect to that model).</p><p>Given the complex form of the limiting covariance in Theorem 2 in terms of linear operators and inner products on H, it is not immediately clear how large this covariance is and whether it is efficient under any conditions. Fortunately, the following theorem, which holds under no additional assumptions, justifies efficiency in the case that our prior estimate for &#952; 0 is consistent.</p><p>Theorem 3 (Efficiency). Let the assumptions of Theorem 2 be given with &#952; = &#952; 0 , and let &#952;n be any estimator that satisfies the conditions of Theorem 2. Then, &#952;n is semiparametrically efficient with respect to the model given by Equation ( <ref type="formula">1</ref>) and &#65533;&#65533; n &#8730; ( &#952;n -&#952; 0 ) is asymptotically normal with asymptotic covariance matrix &#937; -1 0 , where &#937; 0 is defined according to</p><p>This theorem immediately implies that such a kernel VMM estimator is not only efficient with respect to the class of all kernel VMM estimators, but that it achieves the semiparametric efficiency bound for solving Equation (1). This is a very strong result, which ensures that these kernel VMM estimators inherit the efficiency properties that OWGMM estimators possess for standard moment problems, as was hoped.</p><p>Comparing against the efficiency results of the continuum GMM estimators of <ref type="bibr">Carrasco and Florens (2000)</ref>, which is the most similar approach to kernel VMM, ours is stronger. Specifically, they only justified that their estimator is efficient compared with other estimators in their class of continuum GMM estimators, while we have proven efficiency relative to all possible regular estimators. That is, we achieve the same semiparametric efficiency as, e.g., <ref type="bibr">Ai and Chen (2003)</ref> and <ref type="bibr">Chen and Pouzo (2009)</ref>, who use fundamentally different sieve-based approaches.</p><p>Finally, we note that although this limiting covariance matrix has a somewhat complicated form, this form has a variational interpretation similar to the Kernel VMM estimator itself. We discuss this interpretation and how to use it to estimate the efficient asymptotic variance in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Implementing kernel VMM estimators</head><p>Finally, we address some implementation considerations for kernel VMM estimators.</p><p>Firstly, we note that the above theory does not provide any guidance on how to actually construct a prior estimate &#952;n that has the required properties described in Assumption 4. In order to address this issue, we now present a concrete method for constructing such a &#952;n , which allows us to avoid explicitly assuming Assumption 4. Let us use the terminology that &#952;n is a 0-step kernel VMM estimate if &#952;n is chosen as some arbitrary fixed value, which does not depend on the observed data. Then, for any integer k &gt; 0, we say that &#952;n is a k-step kernel VMM estimate if &#952;n is computed by approximately solving Equation (4) according to J n ( &#952;n ) = inf &#952;&#8712;&#920; J n (&#952;) + o p (1/n), with &#952;n chosen as a (k -1)-step kernel VMM estimate. In other words, &#952;n is a k-step kernel VMM estimate if it is computed by iteratively approximately solving Equation ( <ref type="formula">4</ref>) k times, with &#952;n chosen as the previous iterate solution, starting from some arbitrary constant value. This scheme is analogous to that of the k-step GMM estimator <ref type="bibr">(Hansen et al., 1996)</ref>. Given this definition, we have the following lemma:</p><p>Lemma 5 Suppose that &#952;n is a k-step kernel VMM estimate for some k &gt; 0. Then, given all assumptions of Theorem 2 except for Assumption 4, it follows that &#952;n satisfies the conditions of Assumption 4 with p = 1/2, and &#952; = &#952; 0 .</p><p>Therefore, as long as we construct &#952;n as a k-step kernel VMM estimator as described above for some k &gt; 1, we are assured that Assumption 4 will be met with p = 1/2 and &#952; = &#952; 0 . Given this and Theorem 3, we immediately have the following corollary for k-step kernel VMM estimators.</p><p>Corollary 1 Suppose that &#952;n is calculated as a k-step kernel VMM estimate for some k &gt; 1.</p><p>Then given Assumptions 1-3 and 5-8, and assuming that the regularization coefficient satisfies &#945; n = o(1) and &#945; n = &#969;(n -1/2 ), it follows that &#952;n is semiparametrically efficient for &#952; 0 .</p><p>This corollary ensures that, given our regularity assumptions about F and the conditional moment problem itself, we can construct a specific k-step kernel VMM estimator that is semiparametrically efficient. The above also provides a valid specific choice of the regularization coefficient &#945; n that does not depend on unknown parameters.</p><p>Secondly, we address the fact that the cost function described in Equation ( <ref type="formula">4</ref>) is given by a supremum over the infinite F and provide a closed-form for the objective. By appealing to the representer theorem, and the factorization of F into the direct sum of m RKHSs, we can establish the following lemma.</p><p>Then, the cost function J n (&#952;) being minimized by Equation ( <ref type="formula">4</ref>) is equivalent to</p><p>In other words, the kernel VMM estimator can be computed by minimizing a simple closedform cost function, which is given by a particular convex quadratic form on the terms of the form</p><p>In the special case of instrumental variable regression, where we are fitting the regression function within an RKHS ball, we can not only find a closed-form solution for the cost function J n (&#952;) to be minimized, but for the kernel VMM estimator itself. Specifically, we provide the following lemma, which follows by applying the representer theorem again.</p><p>Lemma 7 Consider the instrumental variable regression problem, where m = 1, &#961;(X; &#952;) = Y -&#952;(T), F is the RKHS with kernel K f , and &#920; is a ball of the RKHS with kernel K g with radius r and centred at zero. In addition, let Y denote the vector of outcomes (Y 1 , . . . , Y n ), let L f and L g denote the kernel Gram matrices of K f and K g on the data Z 1 , . . . , Z n and T 1 , . . . , T n , respectively, and define the n &#215; n matrices Q(&#952;) and M according to</p><p>Then, we have</p><p>where</p><p>for some &#955; n &#8805; 0 which depends implicitly on r, K f , K g , &#952;n , and the observed data.</p><p>The term &#955; n enters into the above equation via Lagrangian duality, since minimizing J n (&#952;) over the RKHS ball with radius r is mathematically equivalent to minimizing J n (&#952;) + &#955; n &#8214;&#952;&#8214; 2 over the entire RKHS, for some implicitly defined &#955; n &#8805; 0. In practice, however, when performing IV regression according to Lemma 7 we could freely select &#955; n as a hyperparameter instead of r. Superficially, the form of this estimator is similar to that of other recently proposed kernel-based estimators for IV regression <ref type="bibr">(Muandet et al., 2019;</ref><ref type="bibr">Singh et al., 2019)</ref>. However, unlike those estimators, ours incorporates optimal weighting using the prior estimate &#952;n .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Neural VMM estimators</head><p>We now consider a different class of VMM estimators, where the sequence of function classes F n is given by a class of neural networks with growing depth and width. We will refer to estimators in this class as neural VMM (N-VMM). Most generally, we will define the class of N-VMM estimators according to</p><p>where R n (f ) is some regularizer. In this section, we analyze K-VMM for different choices of R n .</p><p>For simplicity, we will restrict our theoretical analysis to the case where F n is a fully connected neural network with ReLU activations and a common width in all layers, which allows us to use the universal approximation result of Yarotsky (2017, Theorem 1). Specifically, we fix a network architecture with D n hidden layers, each with W n neurons, with the final fully connected layer connecting to the m outputs. Then, the class F n is given by varying the weights on this network. We note that this choice is made for simplicity of exposition, but similar bounds could be given for different kinds of architectures, using other universal approximation results as in e.g., <ref type="bibr">Yarotsky (2017</ref><ref type="bibr">Yarotsky ( , 2018))</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Neural VMM with kernel regularizer</head><p>First, we consider the case where we regularize using some RKHS norm. Specifically, let F K be a product of m RKHSs satisfying Assumption 1, and</p><p>where</p><p>, and K k is the kernel Gram matrix on the data Z 1 , . . . , Z n using the kernel for the k th dimension of F K . Then, we will consider estimators of the form</p><p>We note that if we were to replace F n with F K in equation ( <ref type="formula">7</ref>), then this equation would be equivalent to equation ( <ref type="formula">4</ref>), since by the representer theorem regularizing by &#8214;f &#8214; F K gives the same supremum over f as regularizing by &#8214;f &#8214; n,K . Given this and the known universal approximation properties of neural networks, it may be hoped that if we grow the class F n sufficiently fast, then the objective we are minimizing over &#952; in equation ( <ref type="formula">7</ref>) is approximately equal to that of equation (4) in a uniform sense over &#952; &#8712; &#920;. This, then, would hopefully imply that this neural VMM estimator is able to achieve the same desirable properties, in terms of consistency, asymptotic normality, and efficiency, as our kernel VMM estimators.</p><p>In order to formalize the above intuition, we first require the following assumption, which allows us to account for the rate of growth of the kernel Gram matrix inverses K -1 k in the results we give below.</p><p>Assumption 9 (Inverse Kernel Growth). There exists some deterministic positive sequence</p><p>In addition, we require the following assumption on the rate of growth on the width W n and depth D n of F n , in order to ensure that we can approximate equation (4) sufficiently well.</p><p>Assumption 10 (Neural Network Size). There exist constants q &#8805; 0, 0 &lt; a &lt; 1/2 and a sequence</p><p>Finally, in our results and discussion below, we will define J n (&#952;) to be the loss in &#952; minimized by &#952;NK-VMM n , and J * n (&#952;) to be the corresponding oracle loss if we were to replace F n with F K .</p><p>Lemma 8 Let Assumptions 1-3, 9, and 10 be given. Then, we have</p><p>This lemma follows by applying recent results on the size of a neural network required to uniformly approximate all functions of a given Sobolev norm <ref type="bibr">(Yarotsky, 2017)</ref>, and also older results that show that, under the conditions of Assumption 1, any RKHS ball has bounded Sobolev norm for any Sobolev space using more than d z /2 derivatives <ref type="bibr">(Cucker &amp; Smale, 2002)</ref>.</p><p>Given this, we can immediately state the following theorem, which ensures that the theoretical results of our kernel VMM estimators carry over to our neural VMM estimators with kernel regularization.</p><p>Theorem 4 Let the assumptions of Theorem 1 and Assumptions 9 and 10 be given. In addition, let &#952;n be any sequence that satisfies J n ( &#952;n ) = inf &#952;&#8712;&#920; J n (&#952;) + o p (n -q ), where q is the constant referenced in Assumption 10. Then, in the case that these assumptions hold with q = 0, we have &#952;n &#8594; &#952; 0 in probability. Furthermore, suppose in addition that the assumptions of Theorem 2 hold, and the above assumptions are strengthened to hold with q = 1. Then, we have that &#65533;&#65533; n &#8730; ( &#952;n -&#952; 0 ) converges in distribution to a mean-zero Gaussian random variable, with covariance as given by Theorem 2.</p><p>Finally, assume that in addition &#952; = &#952; 0 . Then the asymptotic variance of &#952;n is given by Theorem 3, and the estimator is semiparametrically efficient.</p><p>The proof of this theorem follows immediately from Lemma 8, since this Lemma and the Theorem's conditions ensure that J * n ( &#952;n</p><p>). Therefore, we can directly apply Theorems 1-3 to obtain these three results.</p><p>An immediate observation given this theorem is that, if we define k-step estimators as in Section 3.4, then by applying an identical argument we can construct efficient neural VMM estimators without having to explicitly make Assumption 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Neural VMM with other regularizers</head><p>Motivated by our theory above using kernel-based regularizers, we now provide some discussion of general neural VMM estimators of the form given by Equation ( <ref type="formula">6</ref>) for other choices of R n (f ), and in particular we discuss how these estimators may be justified.</p><p>First, consider the case where the kernel Gram matrices K i for i &#8712; [m] are approximately equal to &#963; i I, where &#963; i is some scalar and I is the identity matrix. For example, this is the case if we use a Gaussian kernel with very small length scale parameter. In this case, we may reasonably approximate</p><p>That is, we could justify instead regularizing using some (possibly weighted) Frobenius norm of the matrix given by the values of the vector-valued f at the n data points. This form of regularization is much more attractive than that given by &#8214;f &#8214; n,K , since it does not involve the computation of inverse kernel Gram matrices, and it more naturally fits into estimators for Equation ( <ref type="formula">6</ref>) given by some form of alternating stochastic gradient descent. We also note that this form of regularization, based on the Frobenius norm of f, is similar to that used by <ref type="bibr">Dikkala et al. (2020)</ref>, although with some important differences; their proposed estimators do not include the -(1/4)E n [(f (Z) &#8868; &#961;(X; &#952;n )) 2 term motivated by efficiency theory, and they only present theory on bounding the risk of their learned function given by &#952;n , not on the consistency or semiparametric efficiency of the estimated &#952;n . We discuss this comparison in more detail in Section 8. Alternatively, we may heuristically justify leaving out the R n (f ) term altogether, under the argument that neural network function classes naturally impose some smoothness constraints, and therefore optimizing over F n is morally similar to optimizing over F K with some norm constraint. This intuition can be made more concrete by noting that there is a rich literature showing equivalence between optimizing loss functions over neural network function classes, and optimizing the same loss over some norm-bounded RKHS class whose kernel is implicitly defined by the neural network architecture [see e.g., <ref type="bibr">Shankar et al. (2020)</ref> and citations therein]. However, we leave more specific non-heuristic claims on the performance of our neural VMM algorithms with R n (f ) = 0 to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Implementing neural VMM estimators</head><p>Regardless of the choice of the regularization term R n , the question remains of how to actually solve Equation ( <ref type="formula">6</ref>). Past work <ref type="bibr">(Bennett &amp; Kallus, 2020;</ref><ref type="bibr">Bennett et al., 2019)</ref> has solved this problem using the Optimistic Adam (OAdam) algorithm, which is a form of alternating stochastic gradient descent (that is, alternating between first-order gradient steps minimizing the game objective with respect to &#952; and maximizing the game objective with respect to f) that has been designed to have good properties for solving minimax problems <ref type="bibr">(Daskalakis et al., 2017)</ref>. These past works have proposed to do this by continuously updating &#952;n ; that is, at each iteration of alternating stochastic gradient descent they set &#952;n as the previous iterate solution.</p><p>Alternatively, there is a rich recent literature on other, potentially more efficient, methods for solving smooth game optimization problems such as Equation (4). For example, see <ref type="bibr">Fiez et al. (2020)</ref>; <ref type="bibr">Gidel et al. (2019)</ref>; <ref type="bibr">Lin et al. (2020a</ref><ref type="bibr">Lin et al. ( , 2020b))</ref>; <ref type="bibr">Loizou et al. (2020)</ref>; <ref type="bibr">Thekumparampil et al. (2019)</ref>, and references therein. Some or all of the approaches suggested in these recent works may lead to successful neural VMM implementations. However, we leave this more empirical investigation to future work, and in our experiments we focus on approaches based on OAdam with continuously updated &#952;n , as discussed above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Inference</head><p>So far, we have developed both theory and algorithms for kernel and neural VMM estimators, providing conditions under which such estimators are consistent, asymptotically normal, and/or efficient. We now extend our efficient estimation theory to efficient inferential theory, focusing on the case of &#920; &#8838; R b . Now, suppose we want to construct confidence intervals for &#968;( &#952;n ), for some &#968; : R b 7 ! R. This is a very general kind of quantity to consider, since, for example, if were interested in ( &#952;n ) i for some i &#8712; [b], we could define &#968;(&#952;) = &#952; i . By the delta method, if &#952;n were an efficient estimate then the asymptotic variance of &#968;( &#952;n ) would be &#8711;&#968;(&#952; 0 ) &#8868; &#937; -1 0 &#8711;&#968;(&#952; 0 ), where &#937; -1 0 is the efficient covariance matrix defined in Theorem 3. Therefore, this suggests that we could construct asymptotically calibrated Wald confidence intervals by estimating &#946;&#8868; n &#937; -1 0 &#946;n , for some data-driven &#946;n . In particular, if &#8711;&#968;(&#952; 0 ) were known, which would be the case if &#968; were linear, then we could do this with &#946;n = &#8711;&#968;(&#952; 0 ). Otherwise, we could do this using &#946;n = &#8711;&#968;( &#952;n ), where &#952;n is some consistent estimate of &#952; 0 (such as a VMM estimate), which would be consistent for &#8711;&#968;(&#952; 0 ) given Assumption 7.</p><p>In this section, we provide consistent algorithms for estimating &#946; &#8868; &#937; -1 0 &#946; for arbitrary &#946; &#8712; R b , with analogous kernel and neural varieties of our algorithms. These consistent variance estimators can then immediately be used with the delta method, as discussed above, to construct asymptotically calibrated Wald confidence intervals for our efficient VMM estimators.</p><p>Our algorithms, which are presented in the next subsections, are motivated by the following key lemma.</p><p>Lemma 9 Let &#937; 0 be defined as in Theorem 3, let the conditions of Theorem 3 hold, and let &#8711;&#961;(X; &#952;) &#8712; R m&#215;b denote the Jacobian of &#961;(X; &#952;) with respect to &#952;. Then, for any vector &#946; &#8712; R b , we have</p><p>The first part of this lemma follows by applying a similar variational reformulation argument as in the proof of Lemma 1, and the second part follows by applying a similar argument again on the &#947; &#8868; &#937; 0 &#947; term, given the definition of &#937; 0 from Theorem 3. More details are given in the Appendix.</p><p>We note that the right hand side of Lemma 9 has a very similar structure to the game objective of our VMM algorithms. Given this, the previous argument suggests that the asymptotic variance of any such &#968;( &#952;n ) could be estimated using approaches similar to our kernel and neural VMM estimation algorithms presented previously. In the remainder of this section, we build on this intuition and present kernel-and neural-based algorithms for inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Kernel inference algorithm</head><p>First, we present an inference algorithm along the lines of our kernel VMM estimator. This algorithm is summarized by the following theorem.</p><p>Theorem 5 Let the conditions of Theorem 3 be given, and let &#952;n be any corresponding efficient estimate of &#952; 0 . In addition, let L and Q(&#952;) be defined as in Lemma 6, and define D(&#952;) &#8712; R (n&#8226;m)&#215;b and &#937; n &#8712; R b&#215;b according to</p><p>where &#945; n is any sequence satisfying the assumptions of Theorem 3. Then &#937; n &#8594; &#937; 0 in probability.</p><p>We note that an immediate corollary of this theorem is that, for any continuously differentiable &#968; and an efficient &#952;n such as our VMM estimators, &#8711;&#968;( &#952;n ) &#8868; &#937; - n &#8711;&#968;( &#952;n ) is consistent for the asymptotic variance of &#968;( &#952;n ), which is an efficient estimate of &#968;(&#952; 0 ), where &#937; - n denotes the pseudo-inverse of &#937; n . This follows trivially by the continuous mapping theorem and Slutsky's theorem, since by assumption &#937; 0 is invertible. An advantage of this algorithm is that it allows easy estimation of the entire covariance matrix &#937; -1 0 , from which the asymptotic variance of any singledimensional function of &#952;n can instantly be estimated without applying any additional variational algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Neural inference algorithm</head><p>Our neural inference algorithm is similar in nature to our neural VMM estimator and is given by the following smooth game:</p><p>where R n is a regularizer for f and F n is a sequence of neural net classes. Then, given Lemma 9, we expect v n (&#946;) to be a reasonable estimator for &#946; &#8868; &#937; -1 0 &#946;. Furthermore, following the argument presented at the beginning of Section 5, we expect v n ( &#946;n ) to be a reasonable estimator for the (efficient) asymptotic variance of &#968;( &#952;n ) if &#946;n is consistent for &#8711;&#968;(&#952; 0 ). We note that, unlike for our neural VMM estimator, we do not provide any theoretical guarantees for this algorithm, due to some additional technical complications; unlike the game objective being solved by neural VMM, the space being minimized over for &#947; is unbounded, which complicates the technical argument by universal approximation we used for neural VMM. We leave this theoretical question to future work. However, we note that in our inference experiments in Section 7 this method seems to work well.</p><p>Unlike our kernel inference algorithm, this approach has the disadvantage that it requires solving a separate optimization problem for every given scalar parameter &#968;. In practice, though, this may be alleviated by the practical strengths of neural methods, as discussed previously. In addition, as with our neural VMM algorithm, we may regularize for example by using a kernel-based norm or the Frobenius norm of {f (Z 1 ), . . . , f (Z n )}, or we may omit this regularization term entirely.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Examples</head><p>Next, let us provide some concrete examples of our theory, in order to demonstrate how the assumptions for our consistency and asymptotic normality theory may be satisfied. For each example, we do not discuss Assumptions 1, 4, 9, and 10 explicitly, as these govern design choices for the algorithm that can be generically satisfied given the other assumptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Nonparametric instrumental-variable regression</head><p>First, let us consider a specific nonparametric instrumental regression example, which instantiates Example 1 from Section 1. Specifically, we will consider conditions under which our consistency result Theorem 1 applies. Let us consider the data generating process Y = g(T; &#952; 0 ) + &#1013;, where</p><p>for some fixed &#955; &gt; 0. Let us also suppose that g(T; &#952;) is L(T)-Lipschitz continuous in &#952;, where E[L(T) 2 ] &lt; &#8734;, that sup &#952;&#8712;&#920; &#8214;g(T; &#952;)&#8214; &#8734; &lt; &#8734;, and that G = {g( &#8226; ; &#952;) : &#952; &#8712; &#920;} is a Donsker class. As one example, these conditions would be satisfied if G were given by the class of all monotonic functions on T such that &#8214;g(X; &#952;)&#8214; &#8734; &#8804; b for some fixed b &lt; &#8734;, with the norm on &#920; given by &#8214;&#952; &#8242; -&#952;&#8214; = &#8214;g(T; &#952; &#8242; ) -g(T; &#952;)&#8214; &#8734; . As a second example, G could be a norm-bounded RKHS satisfying the conditions of Assumption 1, with the norm on &#920; given by the corresponding RKHS norm.</p><p>First, given the conditions on Z in this example, Assumption 2 is trivial. Second, given the conditions on the regression class G, along with the assumption that &#8214;Y&#8214; &#8734; &lt; &#8734;, Assumption 3 trivially follows by applying Lemma 9.14 in <ref type="bibr">Kosorok (2007)</ref>. Finally, assuming that the prior estimate &#952;n comes from some arbitrary consistent methodology, then Assumption 5 only needs to hold for &#952; = &#952; 0 . In this case, this is ensured under the above condition on the conditional variance of &#1013;, since</p><p>Given this, we have consistency via Theorem 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Nonparametric instrumental-variable quantile regression</head><p>Next, let us consider a specific nonparametric instrumental regression example, which instantiates Example 2 from Section 1. Again, we will consider conditions under which consistency holds. Let us consider the data generating process Y = g(T; &#952; 0 ) + &#1013;, where Prob(&#1013; &#8804; 0 | Z) = p for almost everywhere Z, and &#961;(X; &#952;) = 1{Y &#8804; g(X; &#952;)} -p. Again, we assume that Z &#8712; R dz , and &#8214;Z&#8214; &#8734; &lt; &#8734;. In this case, we will assume that G = {g( &#8226; ; &#952;) : &#952; &#8712; &#920;} is some regression class that is Donsker under the supremum norm &#8214;&#952; &#8242; -&#952;&#8214; = &#8214;g(X; &#952; &#8242; ) -g(X; &#952;)&#8214; &#8734; , and that Y has bounded density.</p><p>Again, given the conditions on Z in this example, Assumption 2 is trivial, and the Donsker part of Assumption 3 follows from the fact that G is Donsker by Lemma 9.14 of <ref type="bibr">Kosorok (2007)</ref>. Also, we have E[|&#961;(X; &#952; &#8242; ) -&#961;(X; &#952;)|] = Prob( min (g(X; &#952; &#8242; ), g(X; &#952;)) &#8804; Y &#8804; max (g(X; &#952; &#8242; ), g(X; &#952;)). Now, since by assumption Y has bounded density, it easily follows that there exists some constant L such that Prob( min (g(X; &#952; &#8242; ), g(X; &#952;)) &#8804; Y &#8804; max (g(X; &#952; &#8242; ), g(X; &#952;)) &#8804; L&#8214;g(X; &#952; &#8242; ) -g(X; &#952;)&#8214; &#8734; , which gives us the required Lipschitz continuity under L 1 norm. Also, the required boundedness is trivial since |&#961;(X; &#952;)| &#8712; { -p, 1 -p}, so we have Assumption 3. Finally, we have</p><p>surely, and therefore &#8214;V(Z; &#952; 0 ) -1 &#8214; &#8734; &#8804; (pp 2 ) -1 &lt; &#8734;, which gives us Assumption 5, again as long as the prior estimate &#952;n is consistent. Therefore, again we have consistency via Theorem 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Parametric instrumental-variable mean and expectile regression</head><p>Next, we will consider a parametric expectile (including mean) regression example <ref type="bibr">(Newey &amp; Powell, 1987;</ref><ref type="bibr">Sobotka et al., 2013)</ref>, where we can establish both consistency, asymptotic normality, and efficiency. For this example, we will assume that the data generating process is again given by Y = g(T; &#952; 0 ) + &#1013;, where &#1013; instead satisfies pE[&#1013;1{&#1013;</p><p>for some p &#8712; (0, 1). For p = 0.5, we get the usual mean regression. Here, we let &#961;(X; &#952;) = w(X; &#952;)(Y -g(T; &#952;)), where w(X; &#952;) = p1{Y &#8805; g(T; &#952;)} + (1 -p)1{Y &lt; g(T; &#952;)}, and the goal is to find the unique &#952; 0 &#8712; &#920; such that E[&#961;(X; &#952; 0 ) | Z] = 0. Note that this problem that can be seen as a mid-point between standard instrumental variable regression and instrumented quantile regression. Similar to the above example, let us suppose that Z &#8712; R d z , &#8214;Z&#8214; &#8734; &lt; &#8734;, &#8214;Y&#8214; &#8734; &lt; &#8734;, and V[&#1013; | Z] &#8805; &#955; almost surely, for some fixed &#955; &gt; 0. For this example, we will further assume that T &#8712; R dt , &#8214;T&#8214; &#8734; &lt; &#8734;, the regression class is given by g(t; &#952;) = &#952; &#8868; t, where &#920; = {&#952; &#8712; R dt : &#8214;&#952;&#8214; 2 &#8804; b} for some b &lt; &#8734;, and that the matrix</p><p>Again, given the conditions on Z in this example, Assumption 2 is trivial. Similarly, given the boundedness of &#920; and T, and the fact that &#8214;w(X; &#952;)&#8214; &#8734; &#8804; 1, along with Lemma 9.14 of <ref type="bibr">Kosorok (2007)</ref>, we easily have that Assumption 3 holds. In addition, we have V(Z; &#952; 0 ) &#8805; min (p, 1 -p) 2 V[&#1013; | Z], and so Assumption 5 follows from our minimum conditional variance assumption as in the previous example. Therefore, we can establish consistency via Theorem 1.</p><p>Next, under the additional assumption that Y and T both have bounded probability density, then so does Yg(T; &#952;) for every &#952; &#8712; &#920;. Therefore, we can apply Lemma 3 with &#981;(X; &#952;) = Yg(T; &#952;) in order to establish Assumption 6 with D(X; &#952;) = T, which we note trivially satisfies the conditions of Assumption 7. Finally, since</p><p>&gt; 0 for every non-zero &#946;, which establishes Assumption 8. Therefore, we also have asymptotic normality via Theorem 2, and under the condition that the prior estimate &#952;n was consistent we have semiparametric efficiency via Theorem 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Experiments</head><p>We now present a series of experiments to demonstrate our proposed methodologies. We present two kinds of experiments. First, we test the finite-sample performance of our kernel and neural VMM algorithms on a range of synthetic conditional moment problems. In this experiment, we compare their performance with the classical sieve minimum distance (SMD) approach of <ref type="bibr">Ai and Chen (2003)</ref>, which is a sieve-based method that has previously been proposed as a semiparametrically efficient approach to solving generic conditional moment problems. In addition, we compare their performance with the recently proposed maximum moment restriction (MMR) algorithm of <ref type="bibr">Zhang et al. (2020)</ref>, which as discussed in Section 8 is equivalent to the limit of our kernel VMM algorithm in the limit as &#945; n &#8594; &#8734;. Second, we test our proposed inference algorithms on a subset of these scenarios, evaluating the quality of the resulting confidence intervals for different variations of our estimation and inference algorithms. Code for reproducing all experiments is available at <ref type="url">https://github.com/CausalML/VMM</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Estimation experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1">Estimation scenarios</head><p>SimpleIV. This is a simple parametric instrumental variable regression scenario, based on a simple data generating process where Z = sin (&#960;U/10), T = -0.75U + 3.5H + 0.14&#951; -0.6, Y = g(T; &#952; 0 ) + -10H + &#1013;.</p><p>In this setup, &#951; and H are exogenous iid N (0, 1) variables, and U is an exogenous iid Uniform ( -5, 5) random variable, and each of T, Z, and Y, are scalars. We note that the random variable H introduces endogeneity. Furthermore, we have g(t; &#952;) = &#952; 1 + &#952; 2 t + &#952; 3 t 2 where &#952; &#8712; R 3 , with the true parameter value given by &#952; 0 = [0.5, 3.0, -0.5]. In this scenario, the conditional moment equation to be solved is E[Yg(T; &#952; 0 ) | Z] = 0; that is, we have X = (T, Y, Z), and &#961;(X; &#952;) = Yg(T; &#952;). Note that in this scenario the relationship between treatment and instruments is nonlinear.</p><p>HeteroskedasticIV. This is a more challenging instrumental variable regression scenario, which introduces a more complex nonlinear regression function class and heteroskedastic noise. It follows a similar data generating process to the prior SimpleIV scenario, except here we have</p><p>where again &#951; and H are iid N (0, 1) distributed, and each of U 1 and U 2 are iid Uniform( -5, 5) distributed. We also note the 'softplus' activation function is defined according to softplus(x) = log (1 + exp (x)). In this case, we have &#952; &#8712; R 4 , and our regression class is defined according to</p><p>That is, our regression class is a smoothed version of a hinge function with slopes &#952; 3 and &#952; 4 and hinge point at (&#952; 1 , &#952; 2 ). The true parameter value is given by &#952; 0 = [2.0, 3.0, -0.5, 3.0]. As with our SimpleIV scenario, the conditional moment restriction is given by</p><p>We note that although the regression residual is not independent of the instruments Z in this setting, it is mean-indepedent, since</p><p>That is, we have heteroskedastic noise with respect to our instruments, which makes achieving efficiency more challenging.</p><p>PolicyLearning. Finally, this scenario is based on learning optimal binary treatment policies from surrogate loss reductions, following <ref type="bibr">Bennett and Kallus (2020)</ref>. Let T &#8712; { -1, 1} denote the binary treatment variable, Z denote individual covariates, Y(t) denote the potential outcome for the individual that would occur if (possibly counter to fact) treatment t were assigned, and Y = Y(T) denote the actual outcome. Then, given logged data where treatments were decided using some randomized policy, and some well-specified parametric class of deterministic treatment policies &#928; = {&#960; &#952; : &#952; &#8712; &#920;}, the task is to estimate the parameters of the optimal policy within &#928; That is, we wish to estimate &#952; 0 = arg max &#952;&#8712;&#920; E[Y(&#960;(Z; &#952;))], where &#960;(z; &#952;) denotes the treatment assigned by policy &#960; &#952; given Z = z. For this problem, we assume the following data generating process:</p><p>for q n (Z; &#952;), but is not prescriptive about the methodology for computing &#915; n . Given this, we experimented with various SMD estimators, using B-splines for q n (Z; &#952;), and multiple approaches for &#915; n : (1) Identity, in which we simply set &#915; n (z) = I &#8704;z; (2) Homoskedastic, in which we set &#915; n = E n [&#961;(X; &#952;n )&#961;(X; &#952;n ) &#8868; ] &#8704;z; and (3) Heteroskedastic, in which we fit a diagonal &#915; n (Z) by regressing &#961;(X; &#952;n ) 2 i on Z for each i &#8712; [m] using neural networks. We provide additional details in the Online Supplementary Material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>OWGMM.</head><p>The OWGMM estimator follows the method described in Section 2.1, for a flexible set of basis functions f 1 , . . . , f k . As with the SMD method we chose these sets of basis functions using B-splines, as this allowed for a very rich and flexible class of moment conditions. Again, we provide additional details in the Online Supplementary Material.</p><p>NCB. Finally, we implemented a simple non-causal baseline (NCB) that estimates &#952; 0 by ignoring Z and instead trying to solve E[&#961;(X; &#952; 0 ) | X] = 0. For example, for our instrumental variable regression scenarios, this corresponds to assumption that there is no endogeneity in the treatments T. For this baseline, we simply minimize the objective</p><p>, which we implement using L-BFGS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.3">Estimation results</head><p>For each scenario and each n &#8712; {200, 500, 1,000, 2,000, 5,000, 10,000}, we repeated the following process 50 times: we drew a training set of n random iid data points using the respective scenario's data generating process as well as an additional dataset of n random dev data points for early stopping, hyperparameter tuning, etc., and then we estimated &#952;n using all of our methods and baselines using the sampled dataset. Then, for each combination of scenario and n we computed the mean squared error (MSE) of the estimated &#952;n across these 50 replications. We summarize the results of this process in Table <ref type="table">1</ref>. In addition, we computed additional results based on the risk (SimpleIV and HeteroskedasticIV) or regret (PolicyLearning) of the estimated g( &#8226; ; &#952;n ). However, these broadly followed the same trend as the main results here, so we leave them to the Online Supplementary Material. In addition, we provide additional tables of results that break down the MSE in terms of bias and standard deviation in the Online Supplementary Material.</p><p>Overall, we can see that in all scenarios the best performing methods are our VMM methods, with the neural VMM method performing best in the SimpleIV and PolicyLearning scenarios, and the kernel VMM method performing best in the HeteroskedasticIV scenario. And, in all cases both the kernel and neural VMM methods significantly outperform the baselines. In particular, apart from the easy SimpleIV scenario, in the more complex HeteroskedasticIV and PolicyLearning scenarios, our VMM methods yield errors that are orders of magnitude smaller.</p><p>In terms of the values of the regularization hyperparameters, we note that kernel VMM can be sensitive to the choice of &#945; n when it takes extreme values. When &#945; n is too small, the algorithm appears to suffer from high variance and the occasional catastrophically bad results, whereas when &#945; n is too large the estimation becomes very biased, with performance converging to that of MMR. However, for &#945; n in the range of 10 -2 -10 -6 , performance is good across all scenarios and n. It remains a question for future work how to automatically select this hyperparameter using observed data. However, we suspect that approaches based on the eigenvalues of (Q( &#952;n ) + &#945; n L) -1 appearing in Lemma 6 might be productive.</p><p>Conversely, we note that our neural VMM algorithm is generally very insensitive to the choice of &#955; n , with very little change in performance even for relatively large values of &#955; n , and very strong and stable performance even when &#955; n = 0. The one minor exception to this is in the challenging PolicyLearning scenario, where using the largest value of &#955; n results in somewhat better performance than other choices for low values of n, but worse performance for large n. This reinforces the notion that the neural network function class and optimization algorithms are naturally regularizing, and that explicit regularization is not necessarily important.</p><p>In general, for both VMM algorithms, we note that there is a wide range of regularization hyperparameter values where performance is generally very good. Furthermore, we note that for both cases the choices of F used were very generic and the same across all scenarios; either an RKHS with a completely generic data-driven kernel, or a very generic shallow MLP. Together, this </p><p>&#945; n = 10 -6 10.7 &#177; 13.6 1.8 &#177; 1.7 1.6 &#177; 1.0 1.6 &#177; .78 1.9 &#177; .57 2.1 &#177; .47</p><p>4 &#177; .93 2.6 &#177; .65 2.8 &#177; .53 &#945; n = 10 -2 4.2 &#177; 3.5 3.9 &#177; 1.9 4.3 &#177; 1.6 4.1 &#177; 1.0 4.6 &#177; .74 4.8 &#177; .71 &#945; n = 1 6.9 &#177; 4.8 8.2 &#177; 2.8 8.7 &#177; 2.1 8.4 &#177; 1.8 8.6 &#177; 1.1 8.6 &#177; .99 N-VMM &#955; n = 0 &gt;100 53.4 &#177; 88.6 6.6 &#177; 12.0 1.1 &#177; .77 .50 &#177; .33 .92 &#177; .39 &#955; n = 10 -4 &gt;100 &gt;100 7.9 &#177; 18.0 1.1 &#177; .70 .54 &#177; .46 .92 &#177; .47 (continued)</p><p>suggests that VMM can generally do very well with generic choices for all hyperparameters and is not very sensitive to these choices as long as they do not take extreme values.</p><p>In the SimpleIV scenario, where E[&#961;(X; &#952;) | Z] is very simple and easy to fit uniformly over &#920;, the SMD and OWGMM baselines performed competitively with our VMM algorithms. However, in the other more challenging scenarios, their behaviour was generally inconsistent and poor. We note that although the average squared error obtained by these methods was extremely high, this seems to be mostly dominated by some outliers, and the typical performance was much more reasonable. For example, in the HeteroskedasticIV scenario when n = 10,000, the median squared error of the Identity, Homoskedastic, and Heteroskedastic versions of SMD were 48.3, 10.7, and 0.22, respectively, which is much less bad than the average squared error. This is also evident, for example, from the separate bias and standard deviation results in the Online Supplementary Material. We also note that for both SMD and OWGMM algorithms, we experimented with a wide range of choices for the underlying sieve basis sets, including the number of knots and polynomial degree for the B-splines that we used, as well as ridge-regularization values, and the results presented are for the least-bad choices. We speculate that the superior performance of our approach is due to the kernel-based regularization of the critic class, which in practice is better able to approximate the efficient instruments with good accuracy and stability. Indeed, it is plausible that sieve-based approaches could also achieve competitive performance using better choices of sieves, with appropriate regularization. In general, however, the use of such sieve spaces, rather than simple linear sieves with optional ridge regularization, as we experimented with, is either intractable or redundant. In the case of SMD, the corresponding sieve estimates q n (z; &#952;) for E[&#961;(X; &#952;) | Z = z] would no longer have closed-form solutions in &#952; in general, and we would somehow have to solve a bi-level optimization problem. On the other hand, if we were to introduce such regularization to the sieve space that implicitly arises from the variational reformulation of OWGMM, we would just end up with our VMM approach as in Equation ( <ref type="formula">2</ref>).</p><p>On the other hand, we see that the MMR baseline performed in a way that was relatively very stable, but consistently sub-optimal. The results of MMR were in general similar to kernel VMM with the largest choices for &#945; n , which is expected given that it is equivalent to kernel VMM with &#945; n &#8594; &#8734;. In addition, as expected, the non-causal baseline is consistently biased with very poor performance.</p><p>Finally, we provide a breakdown of these mean squared error results in terms of bias and variance in the Online Supplementary Material. One interesting observation there is that some methods, in particular our neural VMM algorithm and the OWGMM baseline, do not display the expected behaviour of bias vanishing at a more rapid rate than standard deviation; rather, even though both shrink, their ratio often remains approximately constant. This could be explained by a couple of factors. First, in the case of neural VMM we are not exactly optimizing the minimax optimization problem; rather, we are trying to approximate this using an alternating gradient ascent/descent approach. Therefore, such discrepancies may be explained by this deviation from theory in the practical implementation of the algorithm. Note that issue does not exist for kernel Note. We write &gt;100 whenever the MSE or standard error was greater than 100.</p><p>VMM, which performs the optimization over F n analytically. Second, in the case of OWGMM, this discrepancy seems to be explained by the instability and poor performance described above. This could be be interpreted as 'finite sample' behaviour, reflecting the fact that we are not yet in the asymptotic regime for this method. Alternatively, it may reflect intractable bias due to approximation errors of the sieve basis for the efficient instruments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Inference experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.1">Inference scenarios</head><p>SimpleIV. Our first considered scenario for our inference experiments is based on the same SimpleIV scenario as in our estimation experiments. For this scenario here, our target for inference is the instantaneous treatment effect at T = 0; that is, we wish to estimate &#968;(&#952; 0 ) where</p><p>HeteroskedasticIV. For our second inference scenario, we consider again the same HeteroskedasticIV scenario from our prior estimation experiments. Here, our target for inference is the change in slope in the true hinge function g( &#8226; ; &#952; 0 ). This corresponds to the function &#968;(&#952;) = &#952; 4 -&#952; 3 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.2">Inference methods</head><p>Kernel inference. For our kernel inference method, we implemented the algorithm described by Theorem 5. We used the same kernel function as for our Kernel VMM estimation algorithm in our prior estimation experiments, and we present results for a variety of values of &#945; n .</p><p>Neural inference. For our neural inference method, we solved the game objective described by Equation ( <ref type="formula">9</ref>). We used the same choice of F n and a similar alternating SGD optimization procedure as for our NeuralVMM estimation method. We provide additional details in the Online Supplementary Material. As in our estimation experiments, we used Frobenius norm regularization, and we present results for varying values of &#955; n .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.3">Inference results</head><p>For each scenario and each n &#8712; {200, 2,000}, we repeated the following procedure 200 times: (1) we drew a training set of n random iid data points using the respective scenario's data generating process; (2) we estimated &#952;n using each of our VMM methods; and (3) we estimate the efficient asymptotic variance using each of our inference methods and each of the estimated &#952;n as plug-ins. That is, for each random draw of data, we estimate the efficient asymptotic variance using each combination of VMM estimation method and inference method. In all cases, we compute an estimated 95% confidence interval as where v is the estimated asymptotic variance of &#968;( &#952;n ) via the delta method, which we computed using the corresponding inference method as detailed in Section 5. In addition, for each combination of n, scenario, estimation method, and inference method, we computed the following summary statistics: (1) the coverage rate of our estimated confidence intervals; (2) the corresponding coverage when we adjust the confidence intervals by subtracting the bias of &#968;( &#952;n ) (which we estimated by 1 200</p><p>, where &#952;(i) n denotes the estimate from the i'th replication); and (3) the 5%, 50%, and 95% percentiles of the estimated standard deviation of &#968;( &#952;n ) (given by &#65533;&#65533;&#65533;&#65533; &#65533; v/n &#8730; ) across the 200 replications. Given our previous results that kernel VMM performed very consistently with &#945; n in the range of 10 -2 -10 -6 , for brevity, we only present results in the main paper for when the estimation method is kernel VMM with &#945; n = 10 -4 . However, we present additional results using other estimation methods in the Appendix. We summarize the results from this procedure in Table <ref type="table">2</ref>.</p><p>Overall, we see that in both scenarios, the results are very good when n = 2,000, with very accurate estimates of the standard deviation of &#968;( &#952;n ), and high coverage. For the HeteroskedasticIV scenario, all inference methods produce almost perfect (95%) coverage when n = 2,000, and for the SimpleIV scenario the coverage is only slightly lower, and becomes very close to 95% when bias of &#952;n is taken into account. Note. For each inference method and value of n, we list: Cov the coverage of the respective 95% confidence intervals; CovBC the corresponding bias-corrected coverage, by subtracting the bias of &#968;( &#952;n ) from the confidence intervals; and PredSD(q) the q'th percentile of the estimated standard deviation of &#968;( &#952;n ), for q &#8712; {5, 50, 95}.</p><p>When n = 200, our inference results are slightly poorer. This likely reflects several distinct issues when n is small: the bias of &#952;n may be significant, the variance may not be well characterized by the asymptotic variance and the tails by normal tails, and the estimates of the asymptotic variance of &#968;( &#952;n ) may be poor. Any of these issues may lead to invalid confidence intervals and lower than expected coverage. Indeed, we can see some or all of these issues at play in our results. In Table <ref type="table">2a</ref> we see that coverage is very good when we account for bias, and that the range of the predicted standard deviation of &#952;n is reasonably close to the empirically observed standard deviation of 0.34, which suggests we are suffering from the first issue. Conversely, in Table <ref type="table">2b</ref> we see that, even accounting for bias, the coverage is lower than expected when n = 200, and that the range of predicted standard deviations of &#968;( &#952;n ) is low compared to the empirically observed standard deviation of 1.9, which suggests that we are suffering from the second and/or third issues.</p><p>Regarding the difference in performances between our inference methods, we observe that, as expected given Theorem 5, larger values of &#945; n for our kernel method lead to wider confidence intervals. For n = 2,000, where our asymptotic theory seems to be more relevant, we see very overly wide confidence intervals for our kernel method when &#945; n is very large, with typically good results when &#945; n takes the same range of values that worked well for estimation in our prior experiments (i.e., in the range of 10 -6 -10 -2 ). This suggests that we can tune &#945; n for estimation, and use similar values for inference, and also that we can err on the side of caution and wider confidence intervals by using larger values of &#945; n . Conversely, we found our neural inference method to be very insensitive to &#955; n , and in general we found that it produced relatively narrow confidence intervals, with widths similar to those from our kernel method using the smallest values of &#945; n .</p><p>Finally, we make a note to emphasize the fact that biased-corrected coverage values are listed merely so we can analyze, in cases where coverage is poor, to what extent this is due to bias in the estimate &#952;n , versus due to poor estimates of the standard deviation of &#952;n . Indeed, the biascorrection we perform is not something that can be done in practice, and these bias-corrected coverages should not be interpreted as actual coverages that can be obtained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Related work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Methods for solving conditional moment problems</head><p>For the general conditional moment problem, one classical approach is to solve Equation (3) using a growing sieve basis expansion for {f 1 , . . . , f k } based on, e.g., splines, Fourier series, or power series <ref type="bibr">(Chamberlain, 1987)</ref>. It would be expected, however, that such methods would suffer from curse of dimension issues and therefore their application would be limited to low-dimensional settings. Furthermore, it has been observed in past work <ref type="bibr">(Bennett &amp; Kallus, 2020;</ref><ref type="bibr">Bennett et al., 2019)</ref> that methods of this kind can be very unstable and perform very poorly in comparison to VMM estimators.</p><p>A very similar method to this is the SMD approach of <ref type="bibr">Ai and Chen (2003)</ref>, which instead uses a growing sieve basis expansion to approximate the conditional function E[&#961;(X; &#952;) | Z = z] for every &#952; &#8712; &#920;. They propose to minimize a loss of the form J n (&#952;) = E n [ q(Z; &#952;) &#8868; &#915;(Z) -q(Z; &#952;)], where q(z; &#952;) is the sieve estimate for E[&#961;(X; &#952;)</p><p>One nice feature of this kind of approach is that it can readily handle infinite-dimensional nuisance components. In the case that &#952; can be partitioned as &#952; = (&#946;, &#947;), where &#946; is a finite-dimensional parameter of interest and &#947; is an infinite-dimensional functional nuisance component, <ref type="bibr">Ai and Chen (2003)</ref> propose to model &#947; using a second growing sieve basis expansion, and minimize J n (&#952;) over both &#946; and the sieve coefficients for &#947;. There is a long line of work on the theoretical efficiency of this kind of approach, even in the presence of infinite-dimensional nuisance components <ref type="bibr">(Ai &amp; Chen, 2003;</ref><ref type="bibr">Chen &amp; Pouzo, 2009</ref><ref type="bibr">, 2012)</ref>, which is something that our theory does not address. However, these methods have similar practical drawbacks to using a sieve basis expansion for OWGMM, which seems to particularly be the case when the conditional expectation function q(z; &#952;) = E[&#961;(X; &#952;) | Z = z] is complex, as highlighted by the experimental results in this paper. A very similar approach was also proposed concurrently by <ref type="bibr">Newey and Powell (2003)</ref>, however their approach has the same drawbacks, and furthermore they do not address efficiency.</p><p>Another related classical approach is to solve Equation (3) using estimates of the efficient instruments, which are the set of b functions {f * 1 , . . . , f * b } mapping Z to R b , given by f * i (z) j = F * (z) i,j , where</p><p>Past work such as <ref type="bibr">Newey (1990</ref><ref type="bibr">Newey ( , 1993) )</ref> provide sufficient conditions for such estimators to be efficient. However, since &#952; 0 is unknown, such methods require some other method for first-stage estimation of &#952; 0 , and are likely sensitive to the quality of this method; indeed, if the estimates of f * i are heavily biased due to poor first-stage estimation, it is unclear whether the corresponding moments will be sufficient for identification, let alone efficiency. By contrast, our method is guaranteed to be well behaved as long as our regularized critic class F n can approximate the optimal instruments, regardless of the quality of our first-stage estimate. Furthermore, estimators that have been previously proposed based on this approach <ref type="bibr">(Newey, 1990</ref><ref type="bibr">(Newey, , 1993) )</ref> employ nearest neighbour or sieve methods with similar weaknesses as discussed above.</p><p>The continuum GMM estimators of <ref type="bibr">Carrasco and Florens (2000)</ref> are theoretically closely related to our proposed kernel VMM estimators. However, the form of their proposed estimators is very different. Suppose that we define some set of functions {f ( &#8226; ; t) : t &#8712; T} of the form Z 7 ! R m indexed by set T, and we let H T be some Hilbert space of functions in the form T 7 ! R. In addition, define h</p><p>and the linear operator</p><p>and &#952;n is some prior estimate for &#952; 0 . Then, <ref type="bibr">Carrasco and Florens (2000)</ref> study estimators of the form arg min &#952;&#8712;&#920; &#8214;((</p><p>In that case, we choose T to be an RKHS class F , with functions indexed by themselves, and H chosen as the dual of this RKHS, then it easily follows that the terms C &#8242; n and h &#8242; n defined here are equivalent to the terms C n and h n defined in Section 3. However, the form of Tikhonov regularization applied in the inversion of C 1/2 n is slightly different; by Lemma 2 we regularize using (C n + &#945; n I) -1/2 , whereas they regularized using (C 2 n + &#945; n I) -1/2 C 1/2 n . This difference is significant, since our form of regularization gives rise to the simple minimax VMM-style interpretation, whereas theirs does not. Furthermore, their proposed estimators use the index set T = [0, t max ] for some t max &gt; 0, with H T chosen as the L 2 space on T. This choice is much less flexible than ours of using a function class as the index set and makes it more difficult to guarantee that &#952; 0 is uniquely identified or to guarantee semiparametric efficiency, which they do not. More concretely, the main efficiency claim they provide is that their estimator is efficient compared to other estimators of the form sup &#952;&#8712;&#920; &#8214;B &#8242; n h n (&#952;)&#8214; 2 , for any choice of bounded linear operator B &#8242; n . Finally, they propose to solve their optimization problem by computing an explicit rank-n eigenvalue, eigenvector decomposition of C &#8242; n , and constructing a cost function to minimize based on this decomposition. In particular, if we define g i (&#952;) &#8712; H T according to</p><p>and &#952; &#8712; &#920;, then their objective function is given by a quadratic form on all terms of the form &#9001;g i (&#952;), g j (&#952;)&#9002; H T for i, j &#8712; [n]. This involves n 4 terms in total, and is therefore very computationally expensive to compute for large n. In comparison, the cost function in &#952; implied by our kernel VMM estimator could be calculated analytically as a quadratic form in n 2 terms based on the representer theorem. Furthermore, our variational reformulation allows for estimators based on alternating stochastic gradient descent, which may be more practical in some situations, for example when n is large.</p><p>Another recently proposed and related class of estimators are given by the adversarial GMM estimators of <ref type="bibr">Lewis and Syrgkanis (2018)</ref>, which were recently extended to the more general class of minimax GMM estimators by <ref type="bibr">Dikkala et al. (2020)</ref>. In general, these estimators are defined according to arg min &#952;&#8712;&#920; sup</p><p>, where R n is some regularizer on f, and &#936; n is some regularizer on &#952;. In particular, <ref type="bibr">Dikkala et al. (2020)</ref> analyze estimators where F and &#920; are both normed function spaces, and the regularizes take the form R n</p><p>On the theoretical side, they provide general results bounding the the L 2 distance between E[&#961;(X; &#952;n ) | Z] and E[&#961;(X; &#952; 0 ) | Z] for this form of estimator. Furthermore, they propose various specific estimators of this kind, for example with F chosen as a RKHS or a class of neural networks. We note that these are similar to our proposed kernel and neural VMM estimators, with the difference that they do not include the -(1/4)E n [(f (Z) &#8868; &#961;(X; &#952;n )) 2 ] term motivated by optimal weighting, and that they explicitly regularize &#952;. In a sense, the focus of their estimators and theory is very different than ours; we focus on the question of efficiency, and provide theoretical guarantees of efficiency when &#920; is finitedimensional, whereas they focus on the case where &#920; is a function space, but restrict their analysis to providing finite-sample bounds rather than addressing efficiency. We speculate that the benefits of both kinds of approaches could be combined, and by using both the optimal weighting-based term and regularizing &#952; one could construct estimators that are semiparametrically efficient when &#952; 0 is finite-dimensional, and have explicit risk guarantees in the more general setting. However, we leave this question to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">Methods for solving the instrumental variable regression problem</head><p>Recall that for the instrumental variable regression problem we have X = (Z, T, Y), where T is the treatment we are regressing on, Y is the outcome, and Z is the instrumental variable, and &#961;(X; &#952;) = Yg(T; &#952;), for some regression function g parameterized by &#952;. In this setup, &#920; may either be a finite-dimensional parameter space, which corresponds to having a parametric model for g, or alternatively we may allow allow &#920; to be some infinite-dimensional function space and simply define g(z; &#952;) = &#952;(z), which corresponds to performing nonparametric regression.</p><p>Perhaps, the most classic method for instrumental variable regression is two-stage least squares (2SLS). First, we perform least-squares linear regression of &#981;(T) on &#968;(Z), where &#981; and &#968; are finitedimensional feature maps on T and Z, respectively. That is, we learn some linear model h( &#8226; ; &#947;n ), where h(z; &#947;) = &#947; &#8868; &#968;(z), and &#947;n = arg min &#947; &#1113936; n i=1 &#8214;&#981;(T i ) -&#947; &#8868; &#968;(Z i )&#8214; 2 . Then, we again perform least squares linear regression, this time of Y on h(&#968;(Z); &#947;n ). That is, we learn a linear model g( &#8226; ; &#952;n ), where g(t; &#952;) = &#952; &#8868; &#981;(t), and &#952;n = arg min &#952; &#1113936; n i=1 (Y i -&#952; &#8868; h(Z i ; &#947;n )) 2 . Under the assumption that these linear models are correctly specified, then the resulting 2SLS estimator is known to be consistent for &#952; 0 <ref type="bibr">(Angrist &amp; Pischke, 2008, Section 4.1.1)</ref>. However, such estimators are limited in that they require finding some finite-dimensional feature map &#981; such that the linear model given above is well-specified, which in practice may be infeasible. The sieve methods of <ref type="bibr">Newey and Powell (2003)</ref>, <ref type="bibr">Ai and Chen (2003)</ref> discussed in Section 8.1 applied specifically to the instrumental variable regression problem could be viewed as similar approaches, but using growing sieve basis expansions for &#981; and &#968;. However, as discussed already these methods may be problematic in practice.</p><p>Alternatively, a couple of recent works propose extending the 2SLS method in the case where both stages are performed using infinite-dimensional feature maps and ridge regularization; i.e., both stages are performed using kernel ridge regression. The Kernel IV method of <ref type="bibr">Singh et al. (2019)</ref> proposes to do this in a very direct way, by regressing &#981;(T) on &#968;(Z), and then regressing Y on h(&#981;(Z)), where both the feature maps &#981; and &#968; are infinite dimensional, and implicitly defined by some kernels K Z and K T under Mercer's theorem. In the case of learning h, this corresponds to solving for a linear operator between two RKHSs and in general is ill-posed, so this regression is performed using Tikhonov regularization. Then, the second-stage problem corresponds to performing RKHS regression using some implicit kernel depending on h, and is performed again using Tikhonov regularization. Ultimately, however, by appealing to the representer theorem the regressions don't need to be performed separately, and there is a simple closed form solution. Similarly, the Dual IV method of <ref type="bibr">Muandet et al. (2019)</ref> considers 2SLS using RKHSs for each stage and formulates this as a minimax problem of the form arg min &#952;&#8712;&#920; sup</p><p>Ultimately, both this work and that of <ref type="bibr">Singh et al. (2019)</ref> propose closed-form estimators that are superficially similar to ours in Lemma 7, but without any terms corresponding to optimal weighting. However, their focus is slightly different to ours; their theoretical analysis where present is in terms of consistency or regret, whereas the focus of our theoretical analysis is semiparametric efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>836</head><p>Bennett and Kallus</p><p>The recent Deep IV method of <ref type="bibr">Hartford et al. (2017)</ref> proposes to extend 2SLS using deep learning. Specifically, they propose in the first stage to fit the conditional distribution of X given Z, for example using a mixture of Gaussians parametrized by neural networks, or by fitting a generative model using some other methodology such as generative adversarial networks or variational autoencoders. Then, in the second stage, they propose to minimize</p><p>, where the conditional expectation &#202;[&#8226; | z] is estimated using the model from the first stage, and g is parameterized using some neural network architecture. This approach has the advantage of being flexible and building on recent advances in deep learning, however they do not provide any concrete theoretical characterizations, and since the first stage is bound to be imperfectly specified this can suffer from the 'forbidden regression' issue <ref type="bibr">(Angrist &amp; Pischke, 2008, Section 4.6.1)</ref>. <ref type="bibr">Zhang et al. (2020)</ref> recently proposed the maximum moment restriction instrumental variable algorithm. They present multiple estimators for approximately solving arg min &#952;&#8712;&#920; sup</p><p>, where F is an RKHS, and &#936; n is an optional regularizer on &#952; in the case that it is infinite-dimensional (however, they also analyze case where &#952; is finite-dimensional.) Of particular note, the 'V-statistic' version of their algorithm is equivalent to minimizing</p><p>, where &#961;(&#952;) and L are defined as in Lemma 6. Letting J K-VMM n (&#952;; &#945;) denote our kernel VMM objective with regularization strength &#945;, and assuming &#936; n (&#952;) = 0, Lemma 6 immediately implies that &#945;J K-VMM n (&#952;; &#945;) &#8594; J MMR n (&#952;) as &#945; &#8594; &#8734;. In other words, there is an equivalence between MMR and kernel VMM with infinite regularization. <ref type="bibr">Zhang et al. (2020)</ref> provide theory showing that their estimators are consistent and asymptotically normal under various assumptions. However, unlike us, they do not establish efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">Applications of VMM estimators</head><p>Finally, we discuss some past work where VMM estimators have been applied. The original such work was by <ref type="bibr">Bennett et al. (2019)</ref>, who proposed the DeepGMM estimator for the problem of instrumental variable regression. Specifically, the proposed estimator takes the form arg min &#952; sup f &#8712;F E n [f (Z) &#8868; (Yg(T; &#952;))] -(1/4)E n [f (Z) 2 (Yg(T; &#952;n )) 2 ], where {g( &#8226; ; &#952;) : &#952; &#8712; &#920;} and F are both given by neural network function classes. That is, the DeepGMM estimator can be interpreted as a neural VMM estimator for the instrumental variable problem in the form of equation ( <ref type="formula">6</ref>) with R n (f ) = 0 and fixed F n that does not grow with n. In their experiments DeepGMM consistently outperformed other recently proposed methods <ref type="bibr">(Hartford et al., 2017;</ref><ref type="bibr">Lewis &amp; Syrgkanis, 2018)</ref> across a variety of simple low-dimensional scenarios, and it was the only method to continue working when using high-dimensional data where the treatments and instruments were images. In addition, DeepGMM has continued to perform competitively in more recent experimental comparisons <ref type="bibr">(Muandet et al., 2019;</ref><ref type="bibr">Singh et al., 2019)</ref>. <ref type="bibr">Bennett et al. (2019, Theorem 2)</ref> provided conditions under which DeepGM is consistent. In addition, we could also justify that it is asymptotically normal and semiparametrically efficient by Theorem 4, under some additional assumptions and by introducing kernel-based regularization.</p><p>In addition, this style of estimator was applied to the problem of policy learning from convex surrogate loss reductions by <ref type="bibr">Bennett and Kallus (2020)</ref>. A common approach for optimizing binary treatment decision policies from logged cross-sectional data is to construct a surrogate cost function to minimize of the form E n [|&#968;|l(g(X; &#952;), sign(&#968;))], where X denotes observed pretreatment information about the individual, &#968; is some weighting variable depending on all observed pre-and post-treatment information about the individual, the function g( &#8226; ; &#952;) encodes the policy we are optimizing which we assume is parameterized by &#952; &#8712; &#920;, and l is some smooth convex loss function such as logistic regression loss. <ref type="bibr">Bennett and Kallus (2020)</ref> showed that the model where this surrogate loss is correctly specified is given by the conditional moment problem E[|&#968;|l &#8242; (g(X; &#952;), sign(&#968;)) | X] = 0, where l &#8242; is the derivative of l with respect to its first argument. Consequently, they proposed the empirical surrogate loss policy risk minimization (ESPRM) estimator, according to arg min &#952;&#8712;&#920; sup f &#8712;F E n [f (X)&#961;(X, &#968;; &#952;)] -(1/4)E n [f (X) 2 &#961;(X, &#968;; &#952;n )], where &#961;(X, &#968;; &#952;) = |&#968;|l &#8242; (g(X; &#952;), sign(&#968;)), and F is a neural network function classes. That is, again this estimator can be interpreted as a neural VMM estimator as in equation ( <ref type="formula">6</ref>), with R n (f ) = 0. Not only did the authors demonstrate that this algorithm led to consistently improved empirical performance over the standard approach of empirical risk minimization using the surrogate loss, but they proved that if the resulting estimator &#952;n is semiparametrically efficient, then this implies optimal asymptotic regret for the learnt policy compared with any policy identified by the model given by correct specification. We note that, although the authors did not address the question of how to guarantee such efficiency for &#952;n , we could guarantee it by Theorem 4 under some additional assumptions and kernel-based regularization, or under Theorem 3 by instead using a kernel VMM estimator.</p><p>Finally, <ref type="bibr">Bennett et al. (2021)</ref> applied this style of estimator to the problem of reinforcement learning using offline data logged from some fixed behaviour policy, also known as the problem of off policy evaluation (OPE). They proposed an algorithm for the OPE problem under unmeasured confounding, which requires as an input an estimate of the state density ratio d between the behaviour policy and the target policy they are evaluating. As stated in Section 1, d can be identified by a conditional moment problem, up to a constant factor, with the normalization constraint E[d(S)] = 1. <ref type="bibr">Bennett et al. (2021)</ref> proposed a VMM-style estimator for d, using both the conditional moment condition E[d(S)&#946;(A, S) -d(S &#8242; ) | S &#8242; ] = 0 and the marginal moment condition E[d(S) -1] = 0, based on a slightly more general form of Lemma 1 where the vector of conditional moment restrictions can depend on different random variables to be conditioned on. That is, the more general problem is given by the m moment conditions E[&#961; i (X; &#952; 0 ) | Z i ] = 0 for i &#8712; [m], for some set of random variables Z 1 &#8712; Z 1 , . . . , Z m &#8712; Z m . Specifically, they propose a kernel VMM-style estimator, where both d and f are optimized over balls in RKHSs. In practice, by successively applying the representer theorem to this two-stage optimization problem, they presented a closed-form solution for the estimate dn (in a similar vein to Lemma 7). Note that since their kernel VMM estimator is based on a slightly more intricate conditional moment formulation than we considered, with varying conditioning sets, our theoretical analysis may not apply to it. We leave the question of extending our theoretical analysis to this more general problem to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusion</head><p>In this paper, we presented a detailed theoretical analysis for the class of VMM estimators, which are motivated by a variational reformulation of the optimally weighted generalized method of moments and which encompass several recently proposed estimators for solving conditional moment problems. We studied multiple varieties of these estimators based on kernel methods or deep learning, and provided appropriate conditions under which these estimators are consistent, asymptotically normal, and semiparametrically efficient. This is in contrast to other recently proposed approaches for solving conditional moment problems using machine learning tools, which do not provide any results regarding efficiency. In addition, we proposed inference algorithms based on the same kind of variational reformulation, again with specific algorithms based on both kernel methods and deep learning. Finally, we demonstrated in a detailed series of experiments that our VMM estimators achieve very strong estimation performance in comparison to relevant baselines and that the confidence intervals we generate are reliable.</p><p>Our paper suggests a few immediate directions for future work. First, unlike, e.g., the sieve minimum distance approaches of <ref type="bibr">Ai and Chen (2003)</ref>; <ref type="bibr">Chen and</ref><ref type="bibr">Pouzo (2009, 2012)</ref>, our efficiency theory when &#952; 0 is finite-dimensional does not accommodate possible infinite-dimensional nuisance components. Furthermore, as discussed in Section 3, the latter two works allow for weaker assumptions on the smoothness and complexity of &#961;(X; &#952;). We suspect that our theory could be extended accordingly without fundamentally changing the VMM algorithm, but this is left to future work.</p><p>Second, we only consider conditional moment restrictions using a single conditioning variable Z. In some settings, such as longitudinal studies or the RL application discussed in Section 8.3, one faces conditional moment problems with different, nested conditioning variables for each conditional moment restriction, and our current theory does not accommodate such formulations. Again, we believe that our theory could naturally be extended to this kind of setting.</p><p>Third, we only present theory for neural VMM estimators using a kernel-based regularizer, yet we see compelling empirical results for simpler regularizers. We speculate that under appropriate conditions on the neural net classes F n , our efficiency result in Theorem 4 could be extended to neural VMM estimators with such regularizers.</p><p>Next, an important further direction is the automatic selection of the hyperparameter &#945; n for our kernel VMM method and corresponding inference algorithm. We speculate, for instance, that it may be possible to approximate the resulting bias and variance for different values of &#945; n and optimize a bias-variance trade-off. At the same time, work on approximating the bias of our estimator could be helpful for improving the quality of confidence intervals from our proposed inference algorithm, as we observed that in many cases coverage of our confidence intervals significantly improved when they were corrected for bias. Similarly, it is known that continuously updating GMM can have lower bias than k-step GMM algorithms <ref type="bibr">(Hansen et al., 1996)</ref>, which suggests that we may be able to reduce bias using a continuously updating VMM where instead of using a prior estimate &#952;n in the second term of the game objective we use the same &#952; that we are optimizing over.</p><p>Finally, we hope that this work will help motivate the construction of efficient VMM estimators for other conditional moment problems.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>JR Stat Soc Series B: Statistical Methodology, 2023, Vol. 85, No. 3  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Downloaded from https://academic.oup.com/jrsssb/article/85/3/810/7146121 by Cornell University user on 02 October 2023</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="818" xml:id="foot_2"><p>Bennett and Kallus Downloaded from https://academic.oup.com/jrsssb/article/85/3/810/7146121 by Cornell University user on 02 October 2023</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="820" xml:id="foot_3"><p>Bennett and Kallus</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>Bennett and Kallus Downloaded from https://academic.oup.com/jrsssb/article/85/3/810/7146121 by Cornell University user on 02 October 2023</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="832" xml:id="foot_5"><p>Bennett and Kallus</p></note>
		</body>
		</text>
</TEI>
