<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Causal Matching using Random Hyperplane Tessellations</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of Machine Learning Research</publisher>
				<date>04/30/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10549074</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research: Proceedings of the Third Conference on Causal Learning and Reasoning</title>
<idno>2640-3498</idno>
<biblScope unit="volume">236</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Abhishek Dalvi</author><author>Neil Ashtekar</author><author>Vasant G Honavar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimensionality. To overcome this challenge, we propose a simple, fast, yet highly effective approach to matching using Random Hyperplane Tessellations (RHPT). First, we prove that the RHPT representation is an approximate balancing score – thus maintaining the strong ignorability assumption – and provide empirical evidence for this claim. Second, we report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. In addition, RHPT avoids the need for computationally expensive training of deep neural networks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Randomized experiments, whenever feasible, provide the most reliable estimates of causal effects. In a randomized experiment, each individual is randomly assigned to either the treatment or control group, thereby ensuring covariate balance: the two groups will be only randomly different from each other with respect to all covariates, both observed and unobserved. In this case, it is straightforward to show that the average treatment effect ATE = E[Y |T = 1] -E[Y |T = 0], where Y is the observed outcome and T is the treatment status. In other words, the ATE is simply the difference in average observed outcomes across the treatment and control groups. In the case of a randomized experiment, association is causation.</p><p>However, randomized experiments are not always feasible due to their high costs, ethical concerns, or other fundamental limitations. As such, there is high interest in methods capable of reliably estimating causal effects from observational data. The following assumptions suffice for estimating causal effects from observational data: Ignorability, Positivity, Non-interference and Consistency <ref type="bibr">(Rosenbaum and Rubin, 1983;</ref><ref type="bibr">Cox, 1958;</ref><ref type="bibr">Rubin, 1980)</ref>. Under these conditions, it can be shown that:</p><p>In observational data the treatment assignment is not randomized, it is possible that some of the covariates influence both the treatment assignment and the outcome; thus, introducing confounding bias while estimating the causal effect. In the presence of confounding, causation is not association. One way to eliminate confounding bias under the standard assumptions is to simulate a randomized experiment such that the treatment and control groups have similar covariate distributions. This can be accomplished through matching individuals with similar covariates across the treated and control groups, thereby reducing bias in treatment assignment <ref type="bibr">(Hernan and Robins, 2023;</ref><ref type="bibr">Stuart, 2010)</ref>. Note that matching on the observed covariates also matches on the unobserved covariates, in so much as they are correlated with those that are observed<ref type="foot">foot_1</ref> .</p><p>Comparing the observed outcome for an individual with covariates x i with treatment status T i = 1 against the individual they are matched with, m(x i ) with treatment status T m(x i ) = 0 yields an estimate for the individual treatment effect ITE(x i ). Matching techniques have been extensively studied in the literature on causal inference -see <ref type="bibr">Stuart (2010)</ref> for a review.</p><p>Because perfect matches can seldom be found, matching in practice entails nearest neighbor search. However, matching techniques typically perform poorly when the number of covariates is large, since all points tend to appear dissimilar in high-dimensional spaces. This phenomenon is known as the curse of dimensionality <ref type="bibr">(Beyer et al., 1999)</ref>. Empirical results from <ref type="bibr">Beyer et al. (1999);</ref><ref type="bibr">Aggarwal et al. (2001)</ref> suggest that distance metrics become unreliable given as few as 10-20 dimensions.</p><p>A simple, popular workaround is to match individuals based on their propensity scores -an individual's probability of being treated -and to rely on the theoretical result that once conditioned on the propensity scores, treatment becomes independent of the covariates, thereby satisfying the identifiability conditions. In practice, matching on propensity scores seldom achieves its intended objective <ref type="bibr">(King and Nielsen, 2019)</ref>. Recently there has been increased interest in deep neural networks that learn low-dimensional, "treatment-agnostic" representations of high-dimensional data using non-linear mappings <ref type="bibr">(Johansson et al., 2016;</ref><ref type="bibr">Shalit et al., 2017;</ref><ref type="bibr">Louizos et al., 2017;</ref><ref type="bibr">Koch et al., 2021)</ref>. While deep learning approaches offer strong performance, they require large amounts of training data, incur substantial computational overhead for training, and require expensive hyperparameter tuning.</p><p>In this work, we explore Random Hyperplane Tessellations (RHPT) an alternative approach which overcomes the limitations of both existing matching techniques as well as deep neural networks for causal effect estimation. RHPT involves mapping covariates to a high-dimensional, binary space based on the relationship between covariate values and randomly chosen hyperplanes <ref type="bibr">(Plan and Vershynin, 2014;</ref><ref type="bibr">Dirksen et al., 2022)</ref>. The RHPT representation is attractive for causal effect estimation because it offers a compromise between information-poor (and potentially inaccurate) predicted propensity scores and information-rich (but potentially noisy) matching on covariates. RHPT can be viewed as a denoising, information-reduction step performed prior to matching.</p><p>The key contributions of this paper are as follows:</p><p>&#8226; We present a novel application of Random Hyperplane Tessellations for estimating causal effects from observational data and explain via theory why it works.</p><p>&#8226; We show that our method preserves strong ignorability of treatment assignments as our RHPT representations are approximate balancing scores.</p><p>&#8226; We present experiments comparing treatment effect estimates obtained using our proposed method to estimates obtained using traditional matching approaches as well as state-of-theart deep learning techniques. The results of these experiments on several standard benchmark data sets show that our proposed method outperforms traditional matching approaches and is competitive with the state-of-the-art deep neural networks in terms of accuracy while avoiding the substantial computational cost of training deep neural networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Preliminaries</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Problem Definition</head><p>We are given data on n individuals each represented as a tuple (x i , T i , Y i ). x i denotes the (pretreatment) covariates, T i denotes treatment status (T i = 1 indicating treatment and T i = 0 indicating control), and Y i denotes the observed outcome. Formally, x i &#8712; X with X &#8838; R d , where d is the dimensionality of the covariates, T i &#8712; {0, 1}, and Y i &#8712; Y with Y &#8838; R. Our goal is to estimate the ATE across all individuals and the ITE i for each individuals i under the strong ignorability, consistency, and non-interference assumptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Locality Sensitive hashing</head><p>Locality-Sensitive Hashing (LSH) <ref type="bibr">(Indyk and Motwani, 1998)</ref> was one of the first works related to hyperplane tessellations and is a widely used approach for approximately finding nearest neighbors in high-dimensional spaces. A locality sensitive hashing scheme is a distribution on a family H of hash functions<ref type="foot">foot_2</ref> h operating on a collection of objects, such that for two objects x i and x j ,</p><p>where sim(x i , x j ) is a similarity function. Such a scheme leads to a compact representation of objects such that the similarity of objects can be estimated from their compact sketches. Given covariates x i &#8712; R d , SimHash <ref type="bibr">(Charikar, 2002)</ref> generates a random hyperplane a &#8869; &#8712; R d using a standard Gaussian random vector a &#8764; N (0, I d ). Using random vector a and function sign(&#8226;) : R &#8594; {0, 1}, we generate a hash value for x i :</p><p>&#960; where &#952; ij is the angle between x i and x j . A binary sketch can be obtained by performing hashing &#946; times. Let A : R d &#8594; R &#946; be a matrix with each row a T 1 , &#8226; &#8226; &#8226; a T &#946; drawn from N (0, I d ). For conciseness, denote A &#8764; N &#946; (0, I d ) and apply the sign function elementwise, i.e. sign(&#8226;) : R &#946; &#8594; {0, 1} &#946; . The binary sketch obtained using SimHash is h A (x i ) = sign(Ax i ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Random Hyper-Plane Tessellations (RHPT)</head><p>Plan and Vershynin (2014) showed that Simhash results in a representation where angular distances are preserved with error &#8804; &#948;; given that the points are subset of a unit Euclidean sphere S d-1 . Formally, we are given X &#8838; S d-1 , positive constants c 1 and c 2 , &#8467; * (X ) = E sup x&#8712;X |&#10216;a, x&#10217;|, and &#948; &gt; 0.</p><p>If the following holds: then with a probability of at least 1 -2 exp(-c 2 &#946;&#948; 2 ), for every x i , x j &#8712; X :</p><p>where d H is the Hamming distance and</p><p>is the angular distance. This is consistent with the findings in <ref type="bibr">Charikar (2002)</ref>; <ref type="bibr">Pratap et al. (2020)</ref> which state that:</p><p>To summarize, &#948; &#8594; 0 with probability &#8594; 1 when &#946; &#8594; &#8734;. In other words, distances are approximately preserved as the dimensionality of the embedding is increased. Dirksen et al. ( <ref type="formula">2022</ref>) extend the results from <ref type="bibr">Plan and Vershynin (2014)</ref> beyond the unit Euclidean sphere S d-1 to an arbitrary set R d by introducing shifts on the hyperplane represented by &#947;. Let A &#8764; N &#946; (0, I d ) and &#947; be uniformly distributed on <ref type="bibr">[-&#955;, &#955;]</ref>. For conciseness, we represent</p><p>Given sufficiently large values of &#955; and &#946;, the following holds probabilistically:</p><p>See Theorem 1.7 of Dirksen et al. ( <ref type="formula">2022</ref>) for details. Similar to inequality 1, distances are preserved as the dimensionality of the embedding increases, assuming an appropriate choice of hyperplane shift. Note that <ref type="bibr">Plan and Vershynin (2014)</ref> showed only an existence of such hyperplane over arbitrary sets in R d but Dirksen et al. ( <ref type="formula">2022</ref>) showed an estimate with a bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Controlling for Confounding Covariates</head><p>As noted earlier, matching offers a way to reduce the bias introduced by confounding when estimating causal effects from observational data <ref type="bibr">(Rubin, 1980)</ref>. By comparing the outcomes of pairs of treated and untreated individuals that are nearly identical (i.e. matched), we can eliminate any bias arising from confounding due to any observed covariates (as well as any unobserved covariates that are sufficiently correlated with the observed covariates). The estimated individual treatment effect ITE(x i ) for individual i is given by:</p><p>where M atch 0 (x i ) and M atch 1 (x i ) denote the indices of untreated and treated individuals that match x i . The estimated unobserved outcome is denoted by Y M atch T (x i ) . If the covariates are perfectly matched, then the ignorability condition clearly holds i.e. <ref type="bibr">and Rubin, 1983)</ref>.</p><p>In practice, matches are seldom perfect and we must settle for nearest neighbor matching. In the context of this paper, matching is done on the &#946;-dimensional hashed sketches of the covariate vectors, i.e., their RHPT representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Causal Effect Estimation Using Random Hyperplane Tessellations</head><p>We project the original d-dimensional covariates x i into the Hamming space to get our &#946;-dimensional hashed sketch. In our experiments, we use the Dirksen et al. ( <ref type="formula">2022</ref>) and <ref type="bibr">Plan and Vershynin (2014)</ref> methods in order to approximately preserve both the angle and the Euclidean distance between vectors. We concatenate the results from both methods: h A,&#915; (x i ) = sign(Ax i + &#915;) and h A (x i ) = sign(Ax i ). For brevity, from this point onward the concatenated embedding is represented by h &#946; (x i ). To summarize, we first use RHPT to obtain a binary representation of the covariates, and then use nearest-neighboring matching in the binary space in order to estimate causal effects. Specifically, we use 1-Nearest Neighbor (1-NN) matching with replacement and use Hamming distance to calculate nearest neighbors.</p><p>Matching is a form of covariate adjustment, and covariate adjustment eliminates confounding under the standard assumptions stated in Section 2. To justify matching using RHPT representations of covariate vectors, it suffices to show that the RHPT embeddings achieve covariate balance. We discuss this property of the RHPT embedding in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Approximate Balancing Scores using RHPT</head><p>A function b(x i ) is a balancing score if there exists a mapping from b(x i ) to the propensity score e(x i ) = E[T i |x i ]; &#8704;x i &#8712; X . Let this mapping be g(&#8226;), where g(b(x i )) = e(x i ). Matching can be performed on the balancing scores rather than the raw covariates as implied by Theorems 2 and 3 from <ref type="bibr">Rosenbaum and Rubin (1983)</ref>, i.e. x i &#8869; &#8869; T i |b(x i ) and Y i (1), Y i (0) &#8869; &#8869; T i |b(x i ), assuming the Ignorability, Positivity, Non-interference and Consistency assumptions hold on the original data.</p><p>If the embedding b(x i ) is a one-to-one mapping from X , and there is a mapping from the original covariates to the propensity scores, then there must exist a mapping from the embedding to the propensity scores g(b(x i )) = e(x i ). Because the function b(&#8226;) is one-to-one mapping, an inverse function exists, therefore, we can express the function g(b(x i )) as e(b -1 (b(x i ))).</p><p>Unfortunately, for our method which uses RHPT embeddings of covariate vectors, we cannot guarantee one-to-one mapping using h &#946; (&#8226;) : X &#8594; {0, 1} &#946; . Let x i &#8712; X and x j &#8712; X , be two covariate vector that are mapped to the same binary representation using RHPT. In this case, the Hamming distance d H [h &#946; (x i ), h &#946; (x j )] = 0. Substituting this term into inequalities 1 and 2, we get:</p><p>Therefore, two covariate vectors that are within distance &#948; (angular and L2) from each other may be mapped to the same Hamming code. Because h &#946; (&#8226;) is not a one-to-one mapping, we cannot exactly recover these covariate vectors by "inverting" their representations. Particularly, if d H [h &#946; (x i ), h &#946; (x j )] = 0, then the "reconstruction" h -1 (h &#946; (x i )) could be x i or x j . For simplicity we say that the reconstruction maps to only one of the arbitrarily chosen points with the same binary code (i.e. either x i or x j ).</p><p>The results in section 2 state that as &#946; increases, &#948; &#8594; 0 with high probability. This means that if the Hamming codes are high-dimensional, and the Hamming codes of two points are the same, then the original points will be similar with high probability. In such a scenario, the propensity scores predicted using the reconstruction will be approximately equal if the propensity score function is assumed to be smooth.</p><p>is a smooth function i.e. e(x i ) &#8776; e(x i &#177; &#1013;), where &#1013; is a small perturbation close to 0. It is reasonable to assume that the propensity score function of real-world data is smooth -this means that the probability of treatment does not change sharply across individuals with slightly different covariates vectors.</p><p>Therefore, RHPT embeddings are approximate balancing scores, given that &#946; is sufficiently large and e(&#8226;) is a smooth function. Note that there is no way of knowing the value of &#946; such that h &#946; (x i ) is an approximate balancing score, since calculating the value of such a &#946; involves dealing with unknown constants mentioned in section 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Empirical evidence that RHPT yields approximate balancing scores</head><p>An empirical approach used to verify if a representation is as an approximate balancing score involves fitting a function to predict propensity score given the representation, then comparing this predicted score to the true propensity score. In real-world data sets, we do not have access to true propensity scores. To address this limitation, we generate a semi-synthetic data set using the method proposed by <ref type="bibr">Jesson et al. (2021)</ref> for the MNIST data set. For the purpose of evaluating the proposed method, we obtain the true propensity scores by simulating the data-generation process.</p><p>To assess if our RHPT embeddings are balancing scores, we predict propensity scores using logistic regression learned on our embeddings and compare our predictions to the true propensity scores. Specifically, we calculate the mean absolute difference between the the true propensity scores and our predicted propensity scores, given by &#968;</p><p>is the logistic regression model trained on h &#946; (x i ). Low values of &#968; indicate that our embeddings h &#946; (x i ) are an approximate balancing scores. Figure <ref type="figure">2</ref> illustrates our empirical results across varying embedding dimensions &#946;. We see that as &#946; increases, the mean absolute difference between predicted and true propensity scores decreases. This evaluation was performed in order to complement our theoretical arguments and provide empirical evidence that RHPT embeddings are (approximate) balancing scores. As discussed in Section 3.1, causal effects can be estimated by matching on balancing scores rather than by matching on the raw covariates. The prediction error in Figure <ref type="figure">2</ref> is relatively low, and decreases as the RHPT dimensionality increases. These results suggest that RHPT embeddings are approximate balancing scores and that matching using RHPT is a valid method of estimating causal effects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Uncertainty in causal estimates using RHPT matching</head><p>Our proposed method makes use of a randomized algorithm to map covariates to our RHPT representation. As such, conducting experiments with different random seeds will yield different results. Specifically, different random mappings will result in different nearest-neighbor matches, thus affecting ATE and ITE predictions. Given the theoretical results from section 2, we would expect more variation in our results for lower-dimensional RHPT embeddings. This is because distances &#948; are approximately preserved when the embedding dimensionality &#946; increases.</p><p>To address this, we perform experiments over 100 randomized runs of RHPT with various choice of &#946;, with results and variability illustrated in Figure <ref type="figure">3</ref>. We observe that as we increase the value of &#946;, the variability of the ATE decreases until it eventually levels off when &#946; reaches 10000.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We conduct experiments to compare the performance of RHPT with both classic matching techniques as well as the state-of-the-art deep learning techniques. Baseline 1-Nearest Neighbor (1-NN) matching techniques include matching on the raw covariates X and matching on the PCAtransformed covariates Z. For matching on PCA-transformed covariates we use 5 principal compo- nents following the experiments of <ref type="bibr">Beyer et al. (1999)</ref>, which showed that nearest neighbor queries become unreliable when the number of principal components used exceeds 10.</p><p>Recognizing the potential limitations of PCA as a dimensionality reduction technique, we incorporate additional 1-NN matching methods using alternative dimensionality reduction techniques. Specifically, we utilize Locality Preserving Projections (LPP) proposed in <ref type="bibr">He and Niyogi (2003)</ref> as well as randomized projections based on the Johnson-Lindenstrauss lemma, as outlined in <ref type="bibr">Li et al. (2016)</ref>. We denote the 1-NN matching based on Locality Preserving Projections as LPP and the one based on randomized projections from the Johnson-Lindenstrauss lemma as JL-Lemma.</p><p>We also benchmark against a propensity scores which are calculated using logistic regression over the raw covariate space X and PCA features Z, which we refer to as &#234;(X) and &#234;(Z), respectively. In addition, we consider matching uniformly at random (random matching) as a simple baseline. Finally, we compare our results with those obtained using the state-of-the-art deep learning techniques: CFRNet with Maximum Mean Discrepancy (MMD) <ref type="bibr">(Johansson et al., 2016)</ref> and DragonNet with Targeted Regularization <ref type="bibr">(Shi et al., 2019)</ref>. The codebase for our experiments is available at <ref type="url">https://github.com/Abhishek-Dalvi410/RHPT_matching</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Evaluation Metrics</head><p>We compare RHPT with other methods using two different classes of performance metrics: (1) Error of estimated causal effects (relative to their true value); and (2) Computational cost of causal effect estimation (training as well as inference).</p><p>We evaluate the error of causal effect estimates in two different settings: (1) within-sample, where the task is to estimate the causal effect where the factual outcome is observed, and (2) out-of sample, where the goal is to estimate the causal effect where the factual outcome is unobserved. For within-sample evaluation we estimate the true ITE(x i ) by direct modelling. Given an individual with observation (x i , T i , Y i ), the transductive ITE Johansson et al. ( <ref type="formula">2016</ref>) is given by:</p><p>where the function g(x i , 1) and g(x i , 0) are the estimate of the counterfactual for individual i. For CFRnet and DragonNet, g(x i , 1) and g(x i , 0) are neural networks, while for matching techniques they are &#374;NN 1 (x i ) and &#374;NN 0 (x i ) , respectively. We introduce two within-sample metrics for evaluation. The first is within-sample &#1013; ATE , which is the mean absolute difference between the true and the predicted ATE:</p><p>Note that the true effect i.e. ITE(x i ) considers noiseless outcomes, while the factual outcomes Y i seen in the data-set contain noise in them. We also use the RMSE error for ITE for within-sample evaluation:</p><p>For Out-of-sample metrics, we use we use inductive ITE: ITE out (x i ) = g(x i , 1) -g(x i , 0), where both outcomes are predicted. For matching techniques, g(x i , 1) and g(x i , 0) are 1-NN matches from the within-sample individuals. For deep learning techniques, CFRNet and DragonNet, g(x i , 1) and g(x i , 0) are predicted using neural networks. In this setting we also use out-of-sample &#1013; ATE -the absolute error in the ATE estimation. The final metric we use is Precision in Estimation of Heterogeneous Effect (PEHE) from <ref type="bibr">Hill (2011)</ref>, given by:</p><p>We compare the computational cost of different methods using run-time (training and inference) for each method. The deep learning methods are run on two types of machines: (1) using an NVIDIA A-40 GPU and (2) using the free version of Google Colab in CPU mode (with no GPU). Matching techniques are not evaluated using a GPU since matching does not require computationally intensive training. Note that for the free version of Google Colab in CPU mode, there is a limit on the of execution time. If that limit is exceeded during execution, we report a run-time &#8805; 2000 minutes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Data sets</head><p>Data sets used for evaluation are semi-synthetic, meaning that the covariates come from real-world data while the treatments and outcomes were generated through a user-specified data generating process. For each data set, we report the average and standard error of the within-sample and out-of-sample errors over different instantiations of the data generating process. The wall-time is reported as the total time required to run each method on of all samples from the data generating process. For all data sets, 10% of the samples are used for out-of-sample evaluation. For deep learning models, 20% of in-sample data points are used as validation set to avoid over-fitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Within-sample metrics</head><p>Out-of-  <ref type="formula">2016</ref>) used a simulated data set to replicate how opinions and perceptions toward news articles could differ based on whether they accessed the content on a desktop (T=0) or a mobile device (T=1). In each draw of the data generating process, 5000 news articles are randomly sampled from the NY Times corpus <ref type="bibr">(Newman, 2008)</ref>. The covariates x i of each news article is a Bag of Words with dimensionality 3477. Using the covariates, treatment assignments (desktop (T=0) or mobile device (T=1)) and outcomes (opinions) for each news articles are generated. We use 50 draws of the Data Generating Process provided by Johansson et al. (2016). 4.2.2. HC-MNIST Jesson et al. (2021) introduced a semi-synthetic data set that based on the MNIST dataset LeCun (1998). The covariates are the 28 &#215; 28 pixel values of the images with total dimensionality of 784. The representation of the images are used to generate binary treatment and continuous outcomes.</p><p>The data generating process contains &#915; &#8805; 1 factor which controls the degree of hidden confounding with &#915; = 1 representing no unobserved confounding. For each instance/draw out of the 100 draws of the simulation, the treatment and outcomes are generated according to <ref type="bibr">Jesson et al. (2021)</ref> method and then 3000 images are randomly selected. Only 3000 samples/images are selected out of the complete data set (42000 images) to introduce sparsity in the data space so that we can emulate the curse of dimensionality. We set &#915; = 1.1 to mimic a real world scenario. (Full unconfoundedness is not plausible in the real world). In addition to News and HC-MNIST dataset, we also perform experiment on 100 draws of Infant Health and Development Program (IHDP) dataset by <ref type="bibr">Hill (2011)</ref> which includes covariates from a randomized controlled trial devised to assess how cognitive test scores in premature infants are impacted by visits from specialized doctors. IHDP data set is a low dimensional dataset with 25 covariates of which 16 are binary and 9 are continuous. While this dataset falls outside the primary scope of our algorithm due to its low dimensionality, we have included its results in the appendix to demonstrate that our method remains functional even for low-dimensional data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Hyperparameters for Deep learning models</head><p>The hyperparameters for deep learning models are presented in Table <ref type="table">3</ref>. The hyperparameters are found using the method described in <ref type="bibr">Shalit et al. (2017)</ref>. The loss functions of these deep learning models consist of multiple terms. All these terms are weighted equally i.e. weights/imbalance parameter are 1. In all the models, the initial two layers don't have L2 regularization; every other layer has L2 regularization weighted by 0.01. One significant observation is that the hyper-parameters for early stopping and learning rate were more important as compared to the neural network architecture for achieving strong performance. For training CFRNet, we use Adam optimizer <ref type="bibr">(Kingma and Ba, 2014)</ref> and for DragonNet we use Nesterov accelerated gradient descent (NAG) <ref type="bibr">(Nesterov, 1983)</ref> with momentum. Early Starting LR Dataset Model Layers Stopping Learning Reduction Patience Rate (LR) Patience News CFRNET <ref type="bibr">[400,400,[200,200]</ref>] 40 2.5e-3 10 Dragonnet <ref type="bibr">[200,200,[100,100]</ref>] 40 1e-7 5</p><p>HCMNIST CFRNET <ref type="bibr">[512,256,[256,128]</ref>] 30 2.5e-3 10 Dragonnet <ref type="bibr">[512,256,128,[64,32]</ref>] 20 1e-7 5</p><p>Table <ref type="table">3</ref>: Hyperparameters for the deep learning models</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Results</head><p>For the News dataset, we observe that RHPT matching significantly outperforms traditional matching techniques on all metrics. Additionally, it is noteworthy that the Within-sample &#1013; ATE error for both RHPT matching and deep learning models is almost the same, while the &#1013; ITE error and the Out-of-sample &#1013; ATE error is slightly worse for RHPT, but still within the same ballpark as the deep learning models. There is a significant difference between RHPT and the deep learning models for the &#1013; PEHE error. RHPT matching outperforms CFRNet and DragonNet in terms of the wall-time needed for estimating the causal effect. Due to high dimensionality a large sample size of the News dataset, it takes a significant amount of time to execute on a CPU when using deep learning models.</p><p>Even when a GPU is used, the wall time for RHPT matching is less than half of the GPU wall time of the deep learning models. As seen from Table <ref type="table">2</ref>, HCMNIST dataset appears to be more challenging for causal effect estimation than the News dataset, as the improvement from traditional matching techniques to RHPT and deep learning models is not as pronounced as it is for the News dataset. The In-sample &#1013; ATE error for all methods is nearly the same, with RHPT performing the best. The out-of-sample &#1013; ATE error is significantly better for RHPT matching when compared to traditional matching techniques and on par with the deep learning models. The &#1013; PEHE error is significantly better for RHPT matching than traditional matching techniques, but deep learning methods outperform any of the matching techniques.</p><p>RHPT outperforms traditional matching techniques in terms of both in-sample and out-ofsample metrics when the dimensionality of the RHPT representation is sufficiently large (&#8805; 10k) with only modest increase in computational cost. We see that our matching technique significantly and consistently outperforms all traditional matching techniques when it comes to the &#1013; ATE , &#1013; ITE and &#1013; PEHE and is competitive with CFRNet and DragonNet. Deep learning methods slightly outperform RHPT in terms of out-of-sample &#1013; PEHE error; and RHPT is competitive with deep learning methods in terms of within-sample error. In both cases, RHPT matching incurs computational cost that is an order or two magnitude lower than deep learning methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and Discussion</head><p>We have introduced a simple, fast, yet effective approach to estimating causal effects from highdimensional observational data. The proposed method makes use of Random Hyperplane Tessellations (RHPT) to map covariates into a high-dimensional, binary space based on the relationship between covariate values and randomly chosen hyperplanes. We show that RHPT ensures approximate covariate balance, a theoretical requirement for accurate causal effect estimation from observational data. We report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. Matching using RHPT also provides several important benefits not offered by deep learning methods. Namely, RHPT avoids the need for computationally expensive training, does not require large amounts of data, and requires little hyperparameter tuning.</p><p>A promising direction for future research include extensions of the proposed method to causal effect estimation in settings where the observations are not independent and identically distributed, e.g., when individuals are linked by social or other ties. Another interesting direction is deeper theoretical analyses of the conditions under which RHPT or similar methods can be expected to yield accurate causal effect estimates. Finally, causal inference in the online learning and continual learning settings remains relatively unexplored. Matching-based approaches such as RHPT are well-suited for this setting, as they do not require iterative training and can be easily updated as new data becomes available.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#169; 2024 A. Dalvi, N. Ashtekar &amp; V. Honavar.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>Bias introduced by unobserved covariates that are uncorrelated with the observed covariates is assessed using sensitivity analyses<ref type="bibr">(Hernan and Robins, 2023)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>A hash function maps input samples to fixed-sized output hash codes. We consider Locality-sensitive hashing (LSH); a fuzzy hashing technique that hashes similar input items into the same to similar hash codes.</p></note>
		</body>
		</text>
</TEI>
