<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Double Machine Learning Density Estimation for Local Treatment Effects with Instruments</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10318188</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Y. Jung</author><author>J. Tian</author><author>E. Bareinboim</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[It is common to quantify causal effects with mean values, which, however, may fail to capture significant distribution differences of the outcome under different treatments. We study the problem of estimating the density of the causal effect of a binary treatment on a continuous outcome given a binary instrumental variable in the presence of covariates. Specifically, we consider the local treatment effect, which measures the effect of treatment among those who comply with the assignment under the assumption of monotonicity (only the ones who were offered the treatment take it). We develop two families of methods for this task, kernel-smoothing and model-based approximations -- the former smoothes the density by convoluting with a smooth kernel function; the latter projects the density onto a finite-dimensional density class. For both approaches, we derive double/debiased machine learning (DML) based estimators. We study the asymptotic convergence rates of the estimators and show that they are robust to the biases in nuisance function estimation. We illustrate the proposed methods on synthetic data and a real dataset called 401(k).]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Controlled experimentation is one powerful tool used throughout the empirical sciences to infer the effect of a certain treatment on a given outcome. The idea is to randomize the treatment assignment so as to neutralize the effect of unobserved confounders. However, in some practical settings, it may be challenging to ascertain that individuals who are selected for treatment will follow their recommendations. Issues of non-compliance and unmeasured confounding are quite common and lead to the non-identification of treatment effects in many real-world cases <ref type="bibr">[29,</ref><ref type="bibr">50,</ref><ref type="bibr">32,</ref><ref type="bibr">56</ref>].</p><p>An approach known as instrumental variables (IVs) has been proposed to try to circumvent this issue <ref type="bibr">[68]</ref>. The idea is to find a set of variables (possibly singleton) that are not the target of the analysis by itself but that will help to control for the unobserved confounding between the treatment and the outcome. In particular, IVs are special variables that (i) are correlated with the treatment, (ii) do not directly influence the outcome, and (iii) are not affected by certain unmeasured confounders. For concreteness, consider a study of the effect of 401(k) participation (X) on the distribution of net financial assets (Y ) <ref type="bibr">[2]</ref>. This setting is represented in the causal graph in Fig. <ref type="figure">1</ref>. Note that a dashed-bidirected arrow exists between X and Y , which in graphical language represents unobserved confounding affecting both X and Y . The variable Z in this model represents the eligibility of 401(k). We note that Z qualifies as an instrument in this case -(i) it does affect the participation of 401(k) (X) and (ii) has no direct influence on the net financial asset (Y ), (iii) is not affected by unmeasured confounders between X and Y . The variable W represents observed covariates (e.g., gender, age, ethnicity, income, family size).</p><p>We are interested in the particular setting where only individuals who were offered the treatment may have access to it <ref type="bibr">[31]</ref>. For instance, in the case of 401(k) participation (X = 1), only eligible 35th Conference on Neural Information Processing Systems (NeurIPS 2021). individuals (Z = 1) would be allowed to join the program. This assumption is known in the literature as monotonicity, which rules out the possibility that any units would respond contrary to the instrument. Under monotonicity, the causal effect in the subpopulation whose actual treatment X coincides with the assigned treatment Z (called compliers) is identifiable <ref type="bibr">[31,</ref><ref type="bibr">2]</ref>. The average treatment effect (ATE) for the compliers is called 'Local ATE' (LATE) (or Complier average causal effects, CACE) <ref type="bibr">[31]</ref>.</p><p>The most common quantification of these effects in IV settings found in practice is the average (e.g., LATE). The average is certainly an informative summary; however, it may fail to capture significant differences in the causal distributions of the outcome. For instance, consider Fig. <ref type="figure">2</ref> that shows the densities of outcomes Y under treatments X = 1 among compliers which are generated from samples drawn from four synthetic data generating processes represented by the IV graph in Fig. <ref type="figure">1</ref> (further discussed in Sec. 5). All of the four distributions have the same mean 0 and variance 2. However, the difference in the LTE distributions is self-evident. In this paper, our goal is to provide methods to estimate densities of local treatment effects in IV settings under the monotonicity assumption. We develop two families of methods for this task based on kernel-smoothing and model-based approximations. The former smooths the density by convolution with a kernel function; the latter projects the density onto a finite-dimensional density class based on a distributional distance measure. For both approaches, we construct double/debiased machine learning (DML) style density estimators <ref type="bibr">[43,</ref><ref type="bibr">54,</ref><ref type="bibr">52,</ref><ref type="bibr">70,</ref><ref type="bibr">13]</ref>. We analyze the asymptotic convergence properties of the estimators, showing that they can converge fast (i.e., p n-rate) even when nuisance estimates converge slowly (e.g., n<ref type="foot">foot_0</ref>/4 rate) (a property called 'debiasedness' 1 ). We illustrate the proposed methods on synthetic and real data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Related work</head><p>Our work touches different areas, which we will discuss next. Double/Debiased Machine Learning (DML) <ref type="bibr">[13]</ref>-based causal effect estimators. The DML framework has been adapted for estimating the average causal effect under the setting where the back-door criterion <ref type="bibr">[50,</ref><ref type="bibr">Sec. 3.3.1]</ref> (also known as ignorability <ref type="bibr">[57]</ref>) holds (e.g., <ref type="bibr">[12,</ref><ref type="bibr">19]</ref>). Recently, DML-based causal effect estimators have been developed for any identifiable causal functionals in a given causal graph and equivalence class thereof <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>.</p><p>Local average &amp; quantile treatment effect. The formal identification results for LATE under the monotonicity assumption in IV settings were developed by <ref type="bibr">[31,</ref><ref type="bibr">3]</ref>. Building on these results, semiparametric estimation for LATE has received remarkable attention <ref type="bibr">[2,</ref><ref type="bibr">60,</ref><ref type="bibr">23,</ref><ref type="bibr">62,</ref><ref type="bibr">48]</ref>, including robust LATE estimators that achieve debiasedness <ref type="bibr">[47,</ref><ref type="bibr">40,</ref><ref type="bibr">38,</ref><ref type="bibr">64]</ref>. As shown in Fig. <ref type="figure">2</ref>, however, the average is sometimes insufficient to capture the effects of the treatment on the distributions of outcomes. To address this issue, the problem of estimating quantiles or CDFs has taken attention. A common approach to estimate quantiles or CDFs is based on the LATE estimation. Since the expectation of Y &#63743;y (Y ), an indicator that outcome Y falls short of threshold y, reduces to the CDF (i.e., replacing Y in LATE with Y &#63743;y (Y )), estimators for the LATE can be used to estimate quantiles or CDFs <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">15,</ref><ref type="bibr">24,</ref><ref type="bibr">16,</ref><ref type="bibr">30,</ref><ref type="bibr">45,</ref><ref type="bibr">18,</ref><ref type="bibr">69]</ref>. Non-regular target estimand. An estimand that possesses no influence functions nor p n-rate estimators is called 'non-regular'. Densities are an example of non-regular target estimands <ref type="bibr">[7,</ref><ref type="bibr">Chap. 3]</ref>. One can approximate a non-regular target with a smooth one such that an influence function and p n-rate estimators can be derived. Two broadly used approaches are kernel-smoothing-based (e.g., <ref type="bibr">[52,</ref><ref type="bibr">6,</ref><ref type="bibr">42,</ref><ref type="bibr">19,</ref><ref type="bibr">35]</ref>) and model-based (e.g., <ref type="bibr">[46,</ref><ref type="bibr">52,</ref><ref type="bibr">21,</ref><ref type="bibr">41,</ref><ref type="bibr">40,</ref><ref type="bibr">39]</ref>). Causal density estimation. There is limited literature on estimating the density of treatment effects. Most of the results assume that the ignorability/backdoor admissibility holds <ref type="bibr">[55,</ref><ref type="bibr">49]</ref>. <ref type="bibr">[22]</ref> used the kernel-smoothing technique to estimate the density of a treatment effect, and <ref type="bibr">[42]</ref> provided a kernelsmoothing-based density estimator that achieves doubly robustness and debiasedness building on top of the work in <ref type="bibr">[53]</ref>. Recently, <ref type="bibr">[39]</ref> investigated a model-based approach and developed estimators that achieve debiasedness properties. Under the IV setting, <ref type="bibr">[10]</ref> provided a local polynomial regressionbased density estimator for local treatment effects; We are not aware of any work studying debiased density estimators. As mentioned, this paper investigates both kernel-smoothing and model-based approaches for estimating local treatment effects under IV settings and develops DML-style density estimators for both.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">LTE Estimation -Problem setup</head><p>In our analysis, each variable is represented with a capital letter (X) and its realized value with a small letter (x). For a discrete (e.g., binary) random variable X, we use x (X) to represent the indicator function such that x (X) = 1 if X = x; x (X) = 0 otherwise. For a continuous variable X with a probability density p(x) of a distribution P and a function f</p><p>where X is the domain for X, and kf (X)k &#8984; p</p><p>Structural Causal Models (SCMs). We use the language of SCMs as our basic semantic and inferential framework <ref type="bibr">[50,</ref><ref type="bibr">4]</ref>. An SCM M is a quadruple M = hU, V, P (U ), F i where U is a set of exogenous (latent) variables following a joint distribution P (u), and V is a set of endogenous (observable) variables whose values are determined by functions F = {f Vi } Vi2V such that V i f Vi (pa i , u i ) where P A i &#10003; V and U i &#10003; U . Each SCM M induces a distribution P (v) and a causal graph G = G(M) over V in which there exists a directed edge from every variable in P A i to V i and dashed-bidirected arrows encode common latent variables (e.g., see Fig. <ref type="figure">1</ref>). Within the structural semantics, performing an intervention and setting X = x is represented through the do-operator, do(X = x), which encodes the operation of replacing the original equations of X (i.e., f X (pa x , u x )) by the constant x and induces a submodel M x and an interventional distribution P (v|do(x)). For any variable Y 2 V , the potential response Y x (u) is defined as the solution of Y in the submodel M x given U = u, which induces a counterfactual variable Y x . Local Treatment Effect (LTE) with IV. We consider the IV setting represented by the causal graph G in Fig. <ref type="figure">1 2</ref> , where Z is a binary instrument with domain {0, 1}, X is a binary treatment with domain {0, 1}, and Y is a (set of) continuous outcomes with bounded domain Y &#8674; R d , and W is a set of covariates (continuous, discrete, or mixed). G satisfies the IV assumption that Z has no direct influence on outcome Y and is not affected by unmeasured confounders between X and Y .</p><p>The causal density p(y x ) is not identifiable from the observed density p(x, y, z, w) due to the unobserved confounders between X and Y . However, the effect is possibly recovered for certain subpopulation under additional assumptions. Formally, a unit in the population is an always-taker if <ref type="bibr">[3,</ref><ref type="bibr">2]</ref>. We will make the following assumptions based on literature. Assumption 1 (Monotonicity). There are no defiers: X Z=1 X Z=0 . Assumption 2 (Positivity). P (x|z, w) &gt; 0, P (z|w) &gt; 0 for any x, z, w.</p><p>Let C denote the event that a unit is a complier (i.e., a unit such that X Z=0 = 0 and X Z=1 = 1). For a given constant a and a variable X, let x a denote the event X = a. The LTE p(y x |C) is identifiable under monotonicity and is given by <ref type="bibr">[31,</ref><ref type="bibr">2]</ref>:</p><p>where the expectation is over W . In this paper, we aim to estimate the LTE density p(y x |C) in Eq. ( <ref type="formula">1</ref>). We will make the following mild assumption on some densities, popularly employed in density estimation literature (e.g., <ref type="bibr">[44,</ref><ref type="bibr">25,</ref><ref type="bibr">27,</ref><ref type="bibr">61,</ref><ref type="bibr">26,</ref><ref type="bibr">42]</ref>). Assumption 3. For any x, z, w, y, densities p(y|w, z, x), p(y|z, x) and p(y x |C) are bounded, and p(y x |C) is twice differentiable.</p><p>DML method. Let &#8984; P 0 denote a functional of an arbitrary distribution P 0 . We use P to denote the true distribution such that D &#8672; P . Let 0 &#8984; P denote the true parameter to be estimated. To estimate 0 , DML-based estimators use a Neyman Orthogonal score '(V ; 0 , &#8984;) (where &#8984; &#8984; &#8984; P 0 is a set of nuisance parameters and &#8984; 0 &#8984; &#8984; P denotes the true nuisances), a function such that E P ['(V ; 0 , &#8984; 0 )] = 0, (@/@&#8984;)| &#8984;=&#8984;0 E P ['(V ; 0 , &#8984;)] = 0. Given ', an DML estimator is constructed using the cross-fitting technique as follows: For randomly split halves of</p><p>In addition to being consistent, the estimator T exhibits a robustness property called debiasedness: T converges to 0 in the root-N rate even when b &#8984; converges to &#8984; 0 in slower N 1/4 rate [13, Thm. 3.1]. A Neyman Orthogonal Score can be derived by adding to its influence function <ref type="bibr">[14,</ref><ref type="bibr">Thm. 1</ref>]. An influence function of the functional P is defined as a solution satisfying E P [ ] = 0, E P &#8677; 2 &#8676; &lt; 1, and (@/@t) Pt | t=0 = E P [ (V ; , &#8984;)S t (V ; t = 0)] where P t &#8984; P (v)(1 + tg(v)) for t 2 R and any bounded mean-zero functions g(&#8226;) over V , and S t (v; t = 0) &#8984; (@/@t) log P t (v)| t=0 <ref type="bibr">[63,</ref><ref type="bibr">Chap. 25]</ref>.</p><p>Due to space constraints, all the proofs are provided in Appendix B in suppl. material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Kernel-smoothing-based approach</head><p>In this section, we develop a kernel-smoothing-based approach for estimating the LTE density. The kernel-smoothing technique approximates a non-pathwise-differentiable target estimand with a differentiable estimand by convoluting the density with a kernel function K(y). For convenience, we denote the target estimand by (y) &#8984; p(y x |C). In the kernel-smoothing-based approach, we will aim to estimate a kernel-smoothed approximation for (y) defined as follows:</p><p>where [f (Y )] is an expectation of a function f (Y ) w.r.t. (y), which is specified as</p><p>The second equality in Eq. ( <ref type="formula">2</ref>) is by Eq. ( <ref type="formula">1</ref>). For a target estimand [f (Y )], we will denote nuisances by &#8673; z (w) &#8984; P (z|w), &#8672; x (z, w) &#8984; P (x|z, w), and</p><p>We aim to construct a DML estimator for the estimand h . Toward this goal, we will first derive a Neyman orthogonal score for h . Since a Neyman orthogonal score can be constructed based on moment score functions (a function of parameters such that its expectation is 0 at the true parameters) [14, Thm. 1], we start by defining the moment score function. Let</p><p>Then, the following is a moment score function for h :</p><p>where h is given in Eq. ( <ref type="formula">2</ref>) and 0 is an estimate of h .</p><p>Next, we derive an influence function for the moment score function m( 0 ; h ). We first define the following function: for a bounded function</p><p>and</p><p>where V X is defined in Eq. ( <ref type="formula">5</ref>). Then, the influence function for the expectation of the moment score function m( 0 ; h ) in Eq. ( <ref type="formula">6</ref>) is given as follows:</p><p>Lemma 1 (Influence function for m( 0 ; h )). Let m( 0 ; h ) be the score defined in Eq. ( <ref type="formula">6</ref>). Then, the influence function for E P [m( 0 ; h )], denoted m , is given by</p><p>where is in Eq. ( <ref type="formula">9</ref>).</p><p>For any score function (e.g., m in Eq. ( <ref type="formula">6</ref>)), its addition to the influence function of the expected score (e.g., m ) is a Neyman orthogonal score<ref type="foot">foot_2</ref> ([14, Thm.1], [13, Sec. 2.2.5]). Specifically, Lemma 2 (Neyman orthogonal score for h ). Let m( 0 ; h ) be the score function in Eq. ( <ref type="formula">6</ref>), and m (&#8984; = {&#8673;, &#8672;, &#10003;}, h ) be the influence function for E P [m( 0 ; h )] given in Eq. <ref type="bibr">(10)</ref>. Then, a Neyman orthogonal score for h is given as '( 0 ; &#8984; = {&#8673;, &#8672;, &#10003;}) &#8984; m( 0 ; h ) + m (&#8984;, ); Specifically,</p><p>Given the Neyman orthogonal score '( 0 ; &#8984;), an estimate &#710; h satisfying</p><p>) gives a DML estimator. Specifically, we propose the following kernel-smoothing based estimator for the LTE density, named 'KLTE' (kernel-based estimator for LTE): Definition 1 (KLTE estimator for h ). Let '( 0 ; &#8984; = {&#8673;, &#8672;, &#10003;}) be the Neyman orthogonal score for h given in Eq. <ref type="bibr">(11)</ref>. Let {D, D 0 } denote the randomly split halves of the samples, where |D| = |D 0 | = n. Let &#8984; = {&#8673;, &#8672;, &#10003;} denote the estimates for the nuisance &#8984; using D 0 . Then, the KLTE estimator for h (y) for all y 2 Y, denoted &#710; h (y), is given by</p><p>where V X and V Y X are given in Eqs. <ref type="bibr">(5,</ref><ref type="bibr">8)</ref>, respectively.</p><p>We will show that the KLTE is a DML estimator exhibiting debiasedness property. Detailed asymptotic properties are discussed next.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Asymptotic convergence</head><p>Now, we study the convergence rate of the estimator &#710; h (y). For any fixed y 2 Y, the error &#710; h (y) (y) will be analyzed in two folds: we will first analyze the error between the estimator in Eq. ( <ref type="formula">12</ref>) and the smoothed estimand in Eq. (2) (i.e., &#710; h (y) h (y)), and then analyze the error between the smoothed estimand and the true estimand (i.e., h (y) (y)).</p><p>The following result gives the error analysis for &#710; h (y) h (y): Lemma 3 (Convergence rate of &#710; h to h ). For any fixed y 2 Y, suppose the estimators for nuisances are consistent; i.e., k&#9003; &#9003;k = o P (1) for &#9003; 2 &#8984; = {&#8673;, &#8672;, &#10003;} for all (w, z, x). Suppose h &lt; 1, and nh d ! 1 as n ! 1. Then,</p><p>where</p><p>The error analysis in Lemma. 3 implies the following: Corollary 1 (Debiasedness property of &#710; h to h ). If all nuisances {&#8673;, &#8672;, &#10003;} for any given (w, z, x, y) converge at rate {nh d } 1/4 , then the target estimator &#710; h (y) achieves p nh d -rate convergence to h .</p><p>We now analyze the gap between the smoothed estimand h and the true estimand ; i.e., h : Lemma 4 ([66, Thm. 6.28]). The following holds:</p><p>Combining the results of Lemma. <ref type="bibr">(3,</ref><ref type="bibr">4)</ref>, we have the following result: Theorem 1 (Convergence rate of &#710; h to ). For any fixed y 2 Y, suppose the estimators for nuisances are consistent; i.e., k&#9003; &#9003;k = o P (1) for &#9003; 2 &#8984; = {&#8673;, &#8672;, &#10003;} for all (w, z, x). Suppose h &lt; 1, and</p><p>where B y is defined in Eq. ( <ref type="formula">14</ref>), and R k 2 is defined in Eq. ( <ref type="formula">13</ref>).</p><p>Thm. 1 implies that &#710; h (y) converges fast (see Corol. 1) to (y) + B y . A natural question is then how to choose the bandwidth h that minimizes the gap in Eq. <ref type="bibr">(15)</ref>. The following provides a guideline in choosing the bandwidth h: Lemma 5 (Data-adaptive bandwidth selection). The bandwidth h that minimizes the error in Eq. ( <ref type="formula">15</ref>) is h = O(n 1/(d+4) ). This choice of h satisfies the assumption in Lemma 3 (i.e., nh d ! 1). So far, we have analyzed the error &#710; h (y) (y) pointwise for the fixed y 2 Y. To analyze the difference between the two densities &#710; h (y) and (y) for all y 2 Y, we consider the following divergence function of two densities: Definition 2 (f -Divergence D f <ref type="bibr">[20]</ref>). Let f denote a convex function with f (1) = 0. D f (p, q) &#8984; R Y f (p(y), q(y))q(y) d[y], is a f -divergence function between two densities p, q. f -divergence covers many well-known divergences. For example, D f reduces to KL divergence with f (p, q) = (p/q) log(p/q). We will assume that the function f (p, q) in D f is differentiable w.r.t. p and q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Recall that</head><p>We now analyze the distance between &#710; h and w.r.t. D f . The following result provides an upper bound for D f . Lemma 6 (Upper bound of the divergence</p><p>where w(y) &#8984; f 0 2 ( (y), &#732; (y)) &#710; h (y), f 0 2 (p, q) &#8984; (@/@q)f (p, q), and &#732; h (y) &#8984; t &#710; h (y) + (1 t) (y) for some fixed t 2 [0, 1].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>By invoking Thm. 1, we derive an upper bound for D f ( , b h ) as follows:</head><p>Theorem 2 (Convergence rate of &#710; h ). Suppose the estimators for nuisances are consistent; i.e., k&#9003; &#9003;k = o P (1) for &#9003; 2 &#8984; = {&#8673;, &#8672;, &#10003;} for all (w, z, x, y). Suppose D f is a f -divergence such that f (p, q) = 0 if p = q. Suppose w(y) in Lemma 6 is finite. Then,</p><p>where R k 2 is defined in Eq. ( <ref type="formula">13</ref>) and B y is defined in Eq. ( <ref type="formula">14</ref>).</p><p>The following result asserts that the debiasedness property is exhibited w.r.t. D f :</p><p>Suppose w(y) in Lemma 6 is finite. If nuisances {&#8673;, &#8672;, &#10003;} converges at {nh d } 1/4 rate for any (w, z, x, y), then D f ( , b h ) converges to 0 at p nh d -rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Model-based approach</head><p>In this section, we develop a model-based approach for estimating the LTE density (y) = p(y x |C).</p><p>We will approximate with a class of distributions or a density model G = {g(y; ) : 2 R b } where g(y; ) 2 G is differentiable w.r.t. . Example density models include exponential family (e.g., Gaussian distribution), mixture of Gaussians, or more generally, mixture of exponential families. The choice of the density model may depend on domain knowledge. Alternatively, one may choose among a set of candidate density families using separate validation data or applying cross-validation. We adapt the model-based approach developed in <ref type="bibr">[39]</ref> for estimating the causal density under the no unmeasured confounders assumption.</p><p>Given a density model G, the best approximation for (y) is defined as g(y; 0 ) 2 G that achieves the minimum f -divergence to :</p><p>where D f is the f -divergence defined in Def. 2. Our goal is estimating 0 .</p><p>Consider m( ; ) &#8984; (@/@ )D f ( (y), g(y; )). Definition of 0 given in Eq. ( <ref type="formula">17</ref>) implies that m( ; ) = 0 at = 0 . We note that m( ; ) serves as a moment score function. The closed-form expression of the score is given by <ref type="bibr">[39]</ref>:</p><p>where g 0 (y; ) = (@/@ )g(y; ) and f 0 2 (p, q) &#8984; (@/@q)f (p, q). To construct a DML estimator based on the score function m( ; ), we first derive an influence function for the score: Lemma 7 (Influence Function for m( , )). An influence function for m( ; ) in Eq. ( <ref type="formula">18</ref>), denoted m , is given by</p><p>where (&#8984;, )[&#8226;] is defined in Eq. ( <ref type="formula">9</ref>), and</p><p>))} , where g 0 (y; ) &#8984; (@/@ )g(y; ), f 0 1 (p, q) &#8984; (@/@p)f (p, q) and f 00 21 (p, q) &#8984; (@/@p)f 0 2 (p, q).</p><p>We derive a Neyman orthogonal score based on the moment score m( , ) and its influence function m ( , &#8984;, ): Lemma 8 (Neyman orthogonal score for ). A Neyman orthogonal score for estimating , denoted '( 0 ; (&#8984; = {&#8673;, &#8672;, &#10003;}, )), is given by</p><p>where m ( , &#8984;, ) is defined in Eq. <ref type="bibr">(19)</ref>.</p><p>Given the orthogonal score '( 0 ; (&#8984;, )) in Eq. ( <ref type="formula">20</ref> To illustrate, we exemplify Eq. ( <ref type="formula">18</ref>) and Lemma <ref type="bibr">(7,</ref><ref type="bibr">8)</ref> for the case where D f is a KL-divergence and g(y; = {&#181;, 2 }) is a normal distribution. First, m( ; ) = {m &#181; (&#181;; ), m ( 2 ; , &#181;)}, where</p><p>] are estimators for 0 = {&#181; 0 , 2 0 } for the score m( ; ). Also, R f (Y ; , ) &#8984; (@/@ ) log (g(Y ;</p><p>) <ref type="foot">4</ref> . Then, the Neyman orthogonal score is given as</p><p>. Finally, solutions for '(&#181;; 2 , &#8984;, ) and '( 2 , &#181;; &#8984;, ) are given by (&#956;, &#710; 2 ), where, for [&#8226;] in Eq. ( <ref type="formula">9</ref>),</p><p>i and</p><p>i .</p><p>The MLTE estimator in Def. 3 is consistent provided that nuisances estimates &#8984; are consistent <ref type="bibr">[14,</ref><ref type="bibr">Thm.4]</ref>. Such &#710; is known to achieve debiasedness <ref type="bibr">[13]</ref>, since &#710; is a DML estimator. Specifically, Theorem 3 (Convergence rate of &#710; ). Let '( 0 ; (&#8984; = {&#8673;, &#8672;, &#10003;}, ) be given in Eq. <ref type="bibr">(20)</ref>.</p><p>Let m ( , &#8984;, ) be given in Eq. <ref type="bibr">(19)</ref>. Let 0 , &#8984; 0 , 0 denote the true parameters. Let &#710; be the MLTE estimator for defined in Def. 3. Suppose (1) R f (y; , ) is bounded and R 0 f (y; , ) &#8984; (@/@ )R f (y; , ) &lt; 1; for all (&#8984;, ), where M ( 0 , (&#8984;, &#710; )) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Empirical applications</head><p>In this section, we apply the proposed methods to synthetic and real datasets. For the kernel-smoothing based approach, we compare KLTE with a baseline plug-in estimator ('kernel-smoothing'), where estimates of nuisances &#8984; = {&#8673;, &#8672;, &#10003;} are plugged in the estimand Eq. ( <ref type="formula">2</ref>). We use the Gaussian kernel. The bandwidth is set to h = 0.5n 1/5 . In estimating the density, we choose 200 equi-spaced points {y (i) } 200 i=1 in Y and evaluate both estimators at K h,y (i) for i = 1, &#8226; &#8226; &#8226; , 200. For the model-based approach, we compare MLTE (e.g., &#956;, &#710; 2 ) with a moment-score-based estimator (called 'moment'), defined as &#710; m satisfying m( &#710; m ; &#710; ) = o P (n 1/2 ) (e.g., {&#956; m , &#710; 2 m }). We use KL divergence for D f and the normal distribution for g(y; ). For both approaches, nuisances are estimated through a gradient boosting model XGBoost <ref type="bibr">[11]</ref>, which is known to be flexible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Synthetic dataset</head><p>We applied the proposed estimators to estimate the LTE p(y x |C) where the true densities are given as in the 4th plot in Fig. <ref type="figure">2</ref>. As shown in the ground-truth in Fig. <ref type="figure">3a</ref>, true densities p(y x 0 |C), p(y x 1 |C) are given as a mixture of four Gaussians. Estimated densities for Moment and MLTE are given in Fig. <ref type="figure">(3b,</ref><ref type="figure">3c</ref>). We note that model-based approaches fail to capture important characteristics (such as the number of modes) of the true density ('ground-truth' in Fig. <ref type="figure">3a</ref>) because the assumed density class is misspecified. The 'kernel-smoothing' (Fig. <ref type="figure">3d</ref>) captures only one of the modes having the highest densities, and this leads to misinterpretation of the true densities. KLTE (Fig. <ref type="figure">3e</ref>) is able to capture the number, location, and scales of modes correctly. (e) KLTE Figure <ref type="figure">3</ref>: LTE estimation with a synthetic dataset. The ground-truth density is in (a). Red and Green for x 0 and x 1 , respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Application to 401(k) data</head><p>We applied the proposed estimators (KLTE and MLTE) on 401(k) data, where the data generating processes corroborate with Fig. <ref type="figure">1</ref>. Monotonicity assumption holds naturally, since ineligible units (Z = 0) cannot participate (X = 1) in 401(k). In our analysis, we used the dataset introduced by <ref type="bibr">[2]</ref> containing 9275 individuals, which has been studied in <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">5,</ref><ref type="bibr">47,</ref><ref type="bibr">58,</ref><ref type="bibr">64]</ref>, to cite a few. Model-based approaches (Moment in Fig. <ref type="figure">4a</ref> and MLTE in Fig. <ref type="figure">4b</ref>) and kernel-smoothing based approaches (kernel-smoothing in Fig. <ref type="figure">4c</ref> and KLTE in Fig. <ref type="figure">4d</ref>) are implemented to analyze the data.</p><p>The model-based (Fig. <ref type="figure">(4a,</ref><ref type="figure">4b</ref>)) and kernel-smoothing based (Fig. <ref type="figure">(4c,</ref><ref type="figure">4d</ref>)) estimates both capture important characteristics of the distribution, such as mode, location, and scale parameters. The results of proposed estimators (MLTE and KLTE in Fig. <ref type="figure">(4b,</ref><ref type="figure">4d</ref>)) are consistent with findings from previous analyses <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">5,</ref><ref type="bibr">58]</ref>: The effects of the 401(k) participation (i.e., X = 1) on net financial assets are positive over the whole range of asset distributions. To connect to CDF method, we provide in Fig. <ref type="figure">4e</ref> the CDF estimate induced by KLTE density estimation (Fig. <ref type="figure">4a</ref>). We note that the CDF in Fig. <ref type="figure">4e</ref> captures the nonconstant impact trend of the 401(k) participation on the net financial assets, which has been also described in the previous analyses <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">5,</ref><ref type="bibr">58]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we develop kernel-smoothing-based and model-based approaches for estimating the LTE density in the presence of instruments. For each approach, we give Neyman orthogonal scores (Lemma (2,8)) and constructed corresponding DML estimators (KLTE in Def. 1 and MLTE in Def. 3), that exhibit debiasedness property (Corol. <ref type="bibr">(3,</ref><ref type="bibr">4)</ref>). We demonstrated our work through synthetic and real datasets. The performance of model-based estimators depends critically on the choice of the density class. Kernel-based estimators do not have to make assumptions about the true density class but will suffer from the curse of dimensionality. This work is limited to settings where the monotonicity assumption holds, i.e., there are no defiers. One could perform sensitivity analyses on the impact of potential defiers to the estimates as conducted in <ref type="bibr">[65,</ref><ref type="bibr">36]</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Also known as 'nonparametric doubly robust<ref type="bibr">[37]</ref> or 'rate doubly robust'<ref type="bibr">[59]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>It is common in the literature to define IV assumptions in terms of conditional independences among counterfactual<ref type="bibr">[51,</ref><ref type="bibr">9,</ref><ref type="bibr">8,</ref><ref type="bibr">2,</ref><ref type="bibr">60,</ref><ref type="bibr">47,</ref><ref type="bibr">64]</ref>, whose connection with the causal graph in Fig.1is discussed in Assumption A.1</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>A Neyman orthogonal score is a function satisfying EP [ ( , &#8984;0)] = 0 and @ @&#8984; EP [ (V ; , &#8984;)]|&#8984;=&#8984; 0 = 0, where &#8984;0 denotes the true nuisance<ref type="bibr">[13,</ref> Def.2.2]. In words, a score function that is not sensitive to local errors in nuisance models.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>A function class where complexities are restricted. See Def. S.1 in the Appendix for the definition. Donsker class include Sobolev, Bounded monotone, Lipschitz class, etc.</p></note>
		</body>
		</text>
</TEI>
