skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Causal inference in the presence of missing data using a random forest‐based matching algorithm
Observational studies require matching across groups over multiple confounding variables. Across the literature, matching algorithms fail to handle the issue of missing data. Consequently, missing values are regularly imputed prior to being considered in the matching process. However, imputing is not always practical, forcing us to drop an observation due to the deficiency of the chosen algorithm, decreasing the power of the study and possibly failing to capture crucial latent information. We propose a missing data mechanism to incorporate within an iterative multivariate matching method. The underlying framework utilizes random forest as a natural tool in constructing a distance matrix, implemented with surrogate splits where there might be missing values. The output is then easily fed into an optimal matching algorithm. We apply this method to evaluate the effectiveness of supplemental instruction (SI) sessions, a voluntary program where students seek additional help, in a large enrollment, bottleneck introductory business statistics course. This is an observational study with two groups, those who attend multiple SI sessions and those who do not, and, as typical in educational data mining, challenged by missing data. Additionally, we perform a data simulation on missingness to further demonstrate the efficacy of our proposed approach.  more » « less
Award ID(s):
1633130
PAR ID:
10445882
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Stat
Volume:
10
Issue:
1
ISSN:
2049-1573
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation’s accuracy and precision with that of multiple imputation. 
    more » « less
  2. We consider the problem of duplicate detection in noisy and incomplete data: Given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real-world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into “unique entities”, and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set. 
    more » « less
  3. Satellite images using multiple wavelength channels provide crucial measurements over large areas, aiding the understanding of pollution generation and transport. However, these images often contain missing data due to cloud cover and algorithm limitations. In this paper, we introduce a novel method for interpolating missing values in satellite images by incorporating pollution transport dynamics influenced by wind patterns. Our approach utilizes a fundamental physics equation to structure the covariance of missing data, improving accuracy by considering pollution transport dynamics. To address computational challenges associated with large datasets, we implement a gradient ascent algorithm. We demonstrate the effectiveness of our method through a case study, showcasing its potential for accurate interpolation in high-resolution, spatio-temporal air pollution datasets. 
    more » « less
  4. Subgraph matching is a core primitive across a number of disciplines, ranging from data mining, databases, information retrieval, computer vision to natural language processing. Despite decades of efforts, it is still highly challenging to balance between the matching accuracy and the computational efficiency, especially when the query graph and/or the data graph are large. In this paper, we propose an index-based algorithm (G-FINDER) to find the top-k approximate matching subgraphs. At the heart of the proposed algorithm are two techniques, including (1) a novel auxiliary data structure (LOOKUP-TABLE) in conjunction with a neighborhood expansion method to effectively and efficiently index candidate vertices, and (2) a dynamic filtering and refinement strategy to prune the false candidates at an early stage. The proposed G-FINDER bears some distinctive features, including (1) generality, being able to handle different types of inexact matching (e.g., missing nodes, missing edges, intermediate vertices) on node attributed and/or edge attributed graphs or multigraphs; (2) effectiveness, achieving up to 30% F1-Score improvement over the best known competitor; and (3) efficiency, scaling near-linearly w.r.t. the size of the data graph as well as the query graph. 
    more » « less
  5. Abstract Alzheimer's disease is the most common neurodegenerative disease. The aim of this study is to infer structural changes in brain connectivity resulting from disease progression using cortical thickness measurements from a cohort of participants who were either healthy control, or with mild cognitive impairment, or Alzheimer's disease patients. For this purpose, we develop a novel approach for inference of multiple networks with related edge values across groups. Specifically, we infer a Gaussian graphical model for each group within a joint framework, where we rely on Bayesian hierarchical priors to link the precision matrix entries across groups. Our proposal differs from existing approaches in that it flexibly learns which groups have the most similar edge values, and accounts for the strength of connection (rather than only edge presence or absence) when sharing information across groups. Our results identify key alterations in structural connectivity that may reflect disruptions to the healthy brain, such as decreased connectivity within the occipital lobe with increasing disease severity. We also illustrate the proposed method through simulations, where we demonstrate its performance in structure learning and precision matrix estimation with respect to alternative approaches. 
    more » « less