- Award ID(s):
- 2054253
- PAR ID:
- 10340223
- Editor(s):
- Maex, Reinoud
- Date Published:
- Journal Name:
- Computational and Mathematical Methods in Medicine
- Volume:
- 2022
- ISSN:
- 1748-670X
- Page Range / eLocation ID:
- 1 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)We consider the classical Minimum Balanced Cut problem: given a graph $G$, compute a partition of its vertices into two subsets of roughly equal volume, while minimizing the number of edges connecting the subsets. We present the first {\em deterministic, almost-linear time} approximation algorithm for this problem. Specifically, our algorithm, given an $n$-vertex $m$-edge graph $G$ and any parameter $1\leq r\leq O(\log n)$, computes a $(\log m)^{r^2}$-approximation for Minimum Balanced Cut on $G$, in time $O\left ( m^{1+O(1/r)+o(1)}\cdot (\log m)^{O(r^2)}\right )$. In particular, we obtain a $(\log m)^{1/\epsilon}$-approximation in time $m^{1+O(1/\sqrt{\epsilon})}$ for any constant $\epsilon$, and a $(\log m)^{f(m)}$-approximation in time $m^{1+o(1)}$, for any slowly growing function $m$. We obtain deterministic algorithms with similar guarantees for the Sparsest Cut and the Lowest-Conductance Cut problems. Our algorithm for the Minimum Balanced Cut problem in fact provides a stronger guarantee: it either returns a balanced cut whose value is close to a given target value, or it certifies that such a cut does not exist by exhibiting a large subgraph of $G$ that has high conductance. We use this algorithm to obtain deterministic algorithms for dynamic connectivity and minimum spanning forest, whose worst-case update time on an $n$-vertex graph is $n^{o(1)}$, thus resolving a major open problem in the area of dynamic graph algorithms. Our work also implies deterministic algorithms for a host of additional problems, whose time complexities match, up to subpolynomial in $n$ factors, those of known randomized algorithms. The implications include almost-linear time deterministic algorithms for solving Laplacian systems and for approximating maximum flows in undirected graphs.more » « less
-
We generalize the spatial and subset scan statistics from the single to the multiple subset case. The two main approaches to defining the log-likelihood ratio statistic in the single subset case—the population-based and expectation-based scan statistics—are considered, leading to risk partitioning and multiple cluster detection scan statistics, respectively. We show that, for distributions in a separable exponential family, the risk partitioning scan statistic can be expressed as a scaled f-divergence of the normalized count and baseline vectors, and the multiple cluster detection scan statistic as a sum of scaled Bregman divergences. In either case, however, maximization of the scan statistic by exhaustive search over all partitionings of the data requires exponential time. To make this optimization computationally feasible, we prove sufficient conditions under which the optimal partitioning is guaranteed to be consecutive. This Consecutive Partitions Property generalizes the linear-time subset scanning property from two partitions (the detected subset and the remaining data elements) to the multiple partition case. While the number of consecutive partitionings of n elements into t partitions scales as O(n^(t−1)), making it computationally expensive for large t, we present a dynamic programming approach which identifies the optimal consecutive partitioning in O(n^2 t) time, thus allowing for the exact and efficient solution of large-scale risk partitioning and multiple cluster detection problems. Finally, we demonstrate the detection performance and practical utility of partition scan statistics using simulated and real-world data. Supplementary materials for this article are available online.more » « less
-
Abstract We consider a regression analysis of longitudinal data in the presence of outcome‐dependent observation times and informative censoring. Existing approaches commonly require a correct specification of the joint distribution of longitudinal measurements, the observation time process, and informative censoring time under the joint modeling framework and can be computationally cumbersome due to the complex form of the likelihood function. In view of these issues, we propose a semiparametric joint regression model and construct a composite likelihood function based on a conditional order statistics argument. As a major feature of our proposed methods, the aforementioned joint distribution is not required to be specified, and the random effect in the proposed joint model is treated as a nuisance parameter. Consequently, the derived composite likelihood bypasses the need to integrate over the random effect and offers the advantage of easy computation. We show that the resulting estimators are consistent and asymptotically normal. We use simulation studies to evaluate the finite‐sample performance of the proposed method and apply it to a study of weight loss data that motivated our investigation.
-
Recent reporting has revealed that the UK Biobank (UKB)—a large, publicly-funded research database containing highly-sensitive health records of over half a million participants—has shared its data with private insurance companies seeking to develop actuarial AI systems for analyzing risk and predicting health. While news reports have characterized this as a significant breach of public trust, the UKB contends that insurance research is “in the public interest,” and that all research participants are adequately protected from the possibility of insurance discrimination via data de-identification. Here, we contest both of these claims. Insurers use population data to identify novel categories of risk, which become fodder in the production of black-boxed actuarial algorithms. The deployment of these algorithms, as we argue, has the potential to increase inequality in health and decrease access to insurance. Importantly, these types of harms are not limited just to UKB participants: instead, they are likely to proliferate unevenly across various populations within global insurance markets via practices of profiling and sorting based on the synthesis of multiple data sources, alongside advances in data analysis capabilities, over space/time. This necessitates a significantly expanded understanding of the publics who must be involved in biobank governance and data-sharing decisions involving insurers.
-
This paper discusses how two classes of approximate computation algorithms can be adapted, in a modular fashion, to achieve exact statistical inference from differentially private data products. Considered are approximate Bayesian computation for Bayesian inference, and Monte Carlo Expectation-Maximization for likelihood inference. Up to Monte Carlo error, inference from these algorithms is exact with respect to the joint specification of both the analyst's original data model, and the curator's differential privacy mechanism. Highlighted is a duality between approximate computation on exact data, and exact computation on approximate data, which can be leveraged by a well-designed computational procedure for statistical inference.more » « less