Missing data are ubiquitous in many domain such as healthcare. When these data entries are not missing completely at random, the (conditional) independence relations in the observed data may be different from those in the complete data generated by the underlying causal process.Consequently, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. In this paper, we aim at developing a causal discovery method to recover the underlying causal structure from observed data that are missing under different mechanisms, including missing completely at random (MCAR),missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness graphs (m-graphs),we analyze conditions under which additional correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose Miss-ing Value PC (MVPC), which extends the PC algorithm to incorporate additional corrections.Our proposed MVPC is shown in theory to give asymptotically correct results even on data that are MAR or MNAR. Experimental results on both synthetic data and real healthcare applications illustrate that the proposed algorithm is able to find correct causal relations even in the general case of MNAR.
more »
« less
Missing not at random and the nonparametric estimation of the spectral density
The aim of the article is twofold: (i) present a pivotal setting where using an extra experiment for restoring information lost due to missing not at random (MNAR) is practically feasible; (ii) attract attention to a wide spectrum of new research topics created by the proposed methodology of exploring the missing mechanism. It is well known that if the likelihood of missing an observation depends on its value, then the missing is MNAR, no consistent estimation is possible, and the only way to recover destroyed information is to study the likelihood of missing via an extra experiment. One of the main practical issues with an extra‐sample approach is as follows. Letnandmbe the numbers of observations in a MNAR time series and in an extra sample exploring the likelihood of missing respectively. An oracle, that knows the likelihood of missing, can estimate the spectral density of an ARMA‐type spectral density with the MISE proportional to , while a differentiable likelihood may be estimated only with the MISE proportional tom−2/3. On first glance, these familiar facts yield that the proposed approach is impractical becausemmust be in order larger thannto match the oracle. Surprisingly, the article presents the theory and a numerical study indicating thatmmay be in order smaller thannand still the statistician can match performance of the oracle. The proposed methodology is used for the analysis of MNAR time series of systolic blood pressure of a person with immunoglobulin D multiple myeloma. A number of possible extensions and future research topics are outlined.
more »
« less
- Award ID(s):
- 1915845
- PAR ID:
- 10456450
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- Journal of Time Series Analysis
- Volume:
- 41
- Issue:
- 5
- ISSN:
- 0143-9782
- Page Range / eLocation ID:
- p. 652-675
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Traditional methods for handling incomplete data, including Multiple Imputation and Maximum Likelihood, require that the data be Missing At Random (MAR). In most cases, however, missingness in a variable depends on the underlying value of that variable. In this work, we devise model-based methods to consistently estimate mean, variance and covariance given data that are Missing Not At Random (MNAR). While previous work on MNAR data require variables to be discrete, we extend the analysis to continuous variables drawn from Gaussian distributions. We demonstrate the merits of our techniques by comparing it empirically to state of the art software packages.more » « less
-
Aims.We investigate the previous microlensing data collected by the KMTNet survey in search of anomalous events for which no precise interpretations of the anomalies had been suggested. From this investigation, we find that the anomaly in the lensing light curve of the event KMT-2021-BLG-1547 is approximately described by a binary-lens (2L1S) model with a lens possessing a giant planet, but the model leaves unexplained residuals. Methods.We investigated the origin of the residuals by testing more sophisticated models that include either an extra lens component (3L1S model) or an extra source star (2L2S model) on top of the 2L1S configuration of the lens system. From these analyses, we find that the residuals from the 2L1S model originate from the existence of a faint companion to the source. The 2L2S solution substantially reduces the residuals and improves the model fit by Δχ2= 67.1 with respect to the 2L1S solution. The 3L1S solution also improves the fit, but its fit is worse than that of the 2L2S solution by Δχ2= 24.7. Results.According to the 2L2S solution, the lens of the event is a planetary system with planet and host masses (Mp/MJ,Mh/M⊙) = (1.47−0.77+0.64, 0.72−0.38+0.32) lying at a distanceDL= 5.07−1.50+0.98kpc, and the source is a binary composed of a subgiant primary of a lateGor an earlyKspectral type and a main-sequence companion of aKspectral type. The event demonstrates the need for sophisticated modeling of unexplained anomalies if one wants to construct a complete microlensing planet sample.more » « less
-
A<sc>bstract</sc> A search for long-lived heavy neutrinos (N) in the decays of B mesons produced in proton-proton collisions at$$ \sqrt{s} $$ = 13 TeV is presented. The data sample corresponds to an integrated luminosity of 41.6 fb−1collected in 2018 by the CMS experiment at the CERN LHC, using a dedicated data stream that enhances the number of recorded events containing B mesons. The search probes heavy neutrinos with masses in the range 1 <mN< 3 GeV and decay lengths in the range 10−2<cτN< 104mm, where τNis the N proper mean lifetime. Signal events are defined by the signature B →ℓBNX; N →ℓ±π∓, where the leptonsℓBandℓcan be either a muon or an electron, provided that at least one of them is a muon. The hadronic recoil system, X, is treated inclusively and is not reconstructed. No significant excess of events over the standard model background is observed in any of theℓ±π∓invariant mass distributions. Limits at 95% confidence level on the sum of the squares of the mixing amplitudes between heavy and light neutrinos, |VN|2, and oncτNare obtained in different mixing scenarios for both Majorana and Dirac-like N particles. The most stringent upper limit|VN|2< 2.0×10−5is obtained atmN= 1.95 GeV for the Majorana case where N mixes exclusively with muon neutrinos. The limits on|VN|2for masses 1 <mN< 1.7 GeV are the most stringent from a collider experiment to date.more » « less
-
Summary We propose and prove the optimality of a Bayesian approach for estimating the latent positions in random dot product graphs, which we call posterior spectral embedding. Unlike classical spectral-based adjacency, or Laplacian spectral embedding, posterior spectral embedding is a fully likelihood-based graph estimation method that takes advantage of the Bernoulli likelihood information of the observed adjacency matrix. We develop a minimax lower bound for estimating the latent positions, and show that posterior spectral embedding achieves this lower bound in the following two senses: it both results in a minimax-optimal posterior contraction rate and yields a point estimator achieving the minimax risk asymptotically. The convergence results are subsequently applied to clustering in stochastic block models with positive semidefinite block probability matrices, strengthening an existing result concerning the number of misclustered vertices. We also study a spectral-based Gaussian spectral embedding as a natural Bayesian analogue of adjacency spectral embedding, but the resulting posterior contraction rate is suboptimal by an extra logarithmic factor. The practical performance of the proposed methodology is illustrated through extensive synthetic examples and the analysis of Wikipedia graph data.more » « less
An official website of the United States government
