skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference
A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional categorical datasets. This method, called FLAME (Fast Large-scale Almost Matching Exactly), learns a distance metric for matching using a hold-out training data set. In order to perform matching efficiently for large datasets, FLAME leverages techniques that are natural for query processing in the area of database management, and two implementations of FLAME are provided: the first uses SQL queries and the second uses bit-vector techniques. The algorithm starts by constructing matches of the highest quality (exact matches on all covariates), and successively eliminates variables in order to match exactly on as many variables as possible, while still maintaining interpretable high-quality matches and balance between treatment and control groups. We leverage these high quality matches to estimate conditional average treatment effects (CATEs). Our experiments show that FLAME scales to huge datasets with millions of observations where existing state-of-the-art methods fail, and that it achieves significantly better performance than other matching methods.  more » « less
Award ID(s):
1703431
PAR ID:
10291692
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
22
Issue:
31
ISSN:
1533-7928
Page Range / eLocation ID:
1-41
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others’ outcomes. Traditional treatment effect estimators for randomized experiments are biased and error prone in this setting. Our method matches units almost exactly on counts of unique subgraphs within their neighborhood graphs. The matches that we construct are interpretable and high-quality. Our method can be extended easily to accommodate additional unit-level covariate information. We show empirically that our method performs better than other existing methodologies for this problem, while producing meaningful, interpretable results. 
    more » « less
  2. We introduce a flexible framework that produces high-quality almost-exact matches for causal inference. Most prior work in matching uses ad-hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates. In this work, we learn an interpretable distance metric for matching, which leads to substantially higher quality matches. The learned distance metric stretches the covariate space according to each covariate's contribution to outcome prediction: this stretching means that mismatches on important covariates carry a larger penalty than mismatches on irrelevant covariates. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for the estimation of conditional average treatment effects. 
    more » « less
  3. Abstract Adaptive mesh refinement (AMR) is the art of solving PDEs on a mesh hierarchy with increasing mesh refinement at each level of the hierarchy. Accurate treatment on AMR hierarchies requires accurate prolongation of the solution from a coarse mesh to a newly defined finer mesh. For scalar variables, suitably high-order finite volume WENO methods can carry out such a prolongation. However, classes of PDEs, such as computational electrodynamics (CED) and magnetohydrodynamics (MHD), require that vector fields preserve a divergence constraint. The primal variables in such schemes consist of normal components of the vector field that are collocated at the faces of the mesh. As a result, the reconstruction and prolongation strategies for divergence constraint-preserving vector fields are necessarily more intricate. In this paper we present a fourth-order divergence constraint-preserving prolongation strategy that is analytically exact. Extension to higher orders using analytically exact methods is very challenging. To overcome that challenge, a novel WENO-like reconstruction strategy is invented that matches the moments of the vector field in the faces, where the vector field components are collocated. This approach is almost divergence constraint-preserving, therefore, we call it WENO-ADP. To make it exactly divergence constraint-preserving, a touch-up procedure is developed that is based on a constrained least squares (CLSQ) method for restoring the divergence constraint up to machine accuracy. With the touch-up, it is called WENO-ADPT. It is shown that refinement ratios of two and higher can be accommodated. An item of broader interest in this work is that we have also been able to invent very efficient finite volume WENO methods, where the coefficients are very easily obtained and the multidimensional smoothness indicators can be expressed as perfect squares. We demonstrate that the divergence constraint-preserving strategy works at several high orders for divergence-free vector fields as well as vector fields, where the divergence of the vector field has to match a charge density and its higher moments. We also show that our methods overcome the late time instability that has been known to plague adaptive computations in CED. 
    more » « less
  4. We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defined similarity function, e.g., cosine similarity between embeddings. Common techniques like token indexes fail for semantic search since similar elements may be unrelated at the character level. Further, verifying candidates is expensive (cubic versus linear for syntactic overlap), calling for highly selective filters. We propose Koios, the first exact and efficient algorithm for semantic overlap search. Koios leverages sophisticated filters to minimize the number of required graph-matching calculations. Our experiments show that for medium to large sets less than 5% of the candidate sets need verification, and more than half of those sets are further pruned without requiring the expensive graph matching. We show the efficiency of our algorithm on four real datasets and demonstrate the improved result quality of semantic over vanilla set similarity search. 
    more » « less
  5. Instrumental variable analysis is a powerful tool for estimating causal effects when randomization or full control of confounders is not possible. The application of standard methods such as 2SLS, GMM, and more recent variants are significantly impeded when the causal effects are complex, the instruments are high-dimensional, and/or the treatment is high-dimensional. In this paper, we propose the DeepGMM algorithm to overcome this. Our algorithm is based on a new variational reformulation of GMM with optimal inverse-covariance weighting that allows us to efficiently control very many moment conditions. We further develop practical techniques for optimization and model selection that make it particularly successful in practice. Our algorithm is also computationally tractable and can handle large-scale datasets. Numerical results show our algorithm matches the performance of the best tuned methods in standard settings and continues to work in high-dimensional settings where even recent methods break. 
    more » « less