In medical studies, time-to-event outcomes such as time to death or relapse of a disease are routinely recorded along with longitudinal data that are observed intermittently during the follow-up period. For various reasons, marginal approaches to model the event time, corresponding to separate approaches for survival data/longitudinal data, tend to induce bias and lose efficiency. Instead, a joint modeling approach that brings the two types of data together can reduce or eliminate the bias and yield a more efficient estimation procedure. A well-established avenue for joint modeling is the joint likelihood approach that often produces semiparametric efficient estimators for the finite-dimensional parameter vectors in both models. Through a transformation survival model with an unspecified baseline hazard function, this review introduces joint modeling that accommodates both baseline covariates and time-varying covariates. The focus is on the major challenges faced by joint modeling and how they can be overcome. A review of available software implementations and a brief discussion of future directions of the field are also included.
more »
« less
Efficient Algorithms and Implementation of a Semiparametric Joint Model for Longitudinal and Competing Risk Data: With Applications to Massive Biobank Data
Semiparametric joint models of longitudinal and competing risk data are computationally costly, and their current implementations do not scale well to massive biobank data. This paper identifies and addresses some key computational barriers in a semiparametric joint model for longitudinal and competing risk survival data. By developing and implementing customized linear scan algorithms, we reduce the computational complexities from O n 2 or O n 3 to O n in various steps including numerical integration, risk set calculation, and standard error estimation, where n is the number of subjects. Using both simulated and real-world biobank data, we demonstrate that these linear scan algorithms can speed up the existing methods by a factor of up to hundreds of thousands when n > 1 0 4 , often reducing the runtime from days to minutes. We have developed an R package, FastJM, based on the proposed algorithms for joint modeling of longitudinal and competing risk time-to-event data and made it publicly available on the Comprehensive R Archive Network (CRAN).
more »
« less
- Award ID(s):
- 2054253
- PAR ID:
- 10340223
- Editor(s):
- Maex, Reinoud
- Date Published:
- Journal Name:
- Computational and Mathematical Methods in Medicine
- Volume:
- 2022
- ISSN:
- 1748-670X
- Page Range / eLocation ID:
- 1 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online.more » « less
-
null (Ed.)We consider the classical Minimum Balanced Cut problem: given a graph $$G$$, compute a partition of its vertices into two subsets of roughly equal volume, while minimizing the number of edges connecting the subsets. We present the first {\em deterministic, almost-linear time} approximation algorithm for this problem. Specifically, our algorithm, given an $$n$$-vertex $$m$$-edge graph $$G$$ and any parameter $$1\leq r\leq O(\log n)$$, computes a $$(\log m)^{r^2}$$-approximation for Minimum Balanced Cut on $$G$$, in time $$O\left ( m^{1+O(1/r)+o(1)}\cdot (\log m)^{O(r^2)}\right )$$. In particular, we obtain a $$(\log m)^{1/\epsilon}$$-approximation in time $$m^{1+O(1/\sqrt{\epsilon})}$$ for any constant $$\epsilon$$, and a $$(\log m)^{f(m)}$$-approximation in time $$m^{1+o(1)}$$, for any slowly growing function $$m$$. We obtain deterministic algorithms with similar guarantees for the Sparsest Cut and the Lowest-Conductance Cut problems. Our algorithm for the Minimum Balanced Cut problem in fact provides a stronger guarantee: it either returns a balanced cut whose value is close to a given target value, or it certifies that such a cut does not exist by exhibiting a large subgraph of $$G$$ that has high conductance. We use this algorithm to obtain deterministic algorithms for dynamic connectivity and minimum spanning forest, whose worst-case update time on an $$n$$-vertex graph is $$n^{o(1)}$$, thus resolving a major open problem in the area of dynamic graph algorithms. Our work also implies deterministic algorithms for a host of additional problems, whose time complexities match, up to subpolynomial in $$n$$ factors, those of known randomized algorithms. The implications include almost-linear time deterministic algorithms for solving Laplacian systems and for approximating maximum flows in undirected graphs.more » « less
-
Recent reporting has revealed that the UK Biobank (UKB)—a large, publicly-funded research database containing highly-sensitive health records of over half a million participants—has shared its data with private insurance companies seeking to develop actuarial AI systems for analyzing risk and predicting health. While news reports have characterized this as a significant breach of public trust, the UKB contends that insurance research is “in the public interest,” and that all research participants are adequately protected from the possibility of insurance discrimination via data de-identification. Here, we contest both of these claims. Insurers use population data to identify novel categories of risk, which become fodder in the production of black-boxed actuarial algorithms. The deployment of these algorithms, as we argue, has the potential to increase inequality in health and decrease access to insurance. Importantly, these types of harms are not limited just to UKB participants: instead, they are likely to proliferate unevenly across various populations within global insurance markets via practices of profiling and sorting based on the synthesis of multiple data sources, alongside advances in data analysis capabilities, over space/time. This necessitates a significantly expanded understanding of the publics who must be involved in biobank governance and data-sharing decisions involving insurers.more » « less
-
We propose a two-stage estimation procedure for a copula-based model with semi-competing risks data, where the non-terminal event is subject to dependent censoring by the terminal event, and both events are subject to independent censoring. With a copula-based model, the marginal survival functions of individual event times are specified by semiparametric transformation models, and the dependence between the bivariate event times is specified by a parametric copula function. For the estimation procedure, in the first stage, the parameters associated with the marginal of the terminal event are estimated using only the corresponding observed outcomes, and in the second stage, the marginal parameters for the non-terminal event time and the copula parameter are estimated together via maximizing a pseudo-likelihood function based on the joint distribution of the bivariate event times. We derived the asymptotic properties of the proposed estimator and provided an analytic variance estimator for inference. Through simulation studies, we showed that our approach leads to consistent estimates with less computational cost and more robustness than the one-stage procedure developed in Chen (2012), where all parameters were estimated simultaneously. In addition, our approach demonstrates more desirable finite-sample performances over another existing two-stage estimation method proposed in Zhu et al. (2021). An R package PMLE4SCR is developed to implement our proposed method.more » « less
An official website of the United States government

