Causal inference in the presence of missing data using a random forest‐based matching algorithm

Hillis, Tristan; Guarcello, Maureen A.; Levine, Richard A.  (ORCID:0000000275534264); Fan, Juanjuan

doi:10.1002/sta4.326

Observational studies require matching across groups over multiple confounding variables. Across the literature, matching algorithms fail to handle the issue of missing data. Consequently, missing values are regularly imputed prior to being considered in the matching process. However, imputing is not always practical, forcing us to drop an observation due to the deficiency of the chosen algorithm, decreasing the power of the study and possibly failing to capture crucial latent information. We propose a missing data mechanism to incorporate within an iterative multivariate matching method. The underlying framework utilizes random forest as a natural tool in constructing a distance matrix, implemented with surrogate splits where there might be missing values. The output is then easily fed into an optimal matching algorithm. We apply this method to evaluate the effectiveness of supplemental instruction (SI) sessions, a voluntary program where students seek additional help, in a large enrollment, bottleneck introductory business statistics course. This is an observational study with two groups, those who attend multiple SI sessions and those who do not, and, as typical in educational data mining, challenged by missing data. Additionally, we perform a data simulation on missingness to further demonstrate the efficacy of our proposed approach.

More Like this