skip to main content


Title: Bag of little bootstraps for massive and distributed longitudinal data
Abstract

Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia packageMixedModelsBLB.jl.Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.

 
more » « less
Award ID(s):
2054253 2205441
NSF-PAR ID:
10446128
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
15
Issue:
3
ISSN:
1932-1864
Page Range / eLocation ID:
p. 314-321
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features.

    Results

    We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern.

    Availability and implementation

    TopHap is available at https://github.com/SayakaMiura/TopHap.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Background

    Cardiac MR fingerprinting (cMRF) is a novel technique for simultaneous T1and T2mapping.

    Purpose

    To compare T1/T2measurements, repeatability, and map quality between cMRF and standard mapping techniques in healthy subjects.

    Study Type

    Prospective.

    Population

    In all, 58 subjects (ages 18–60).

    Field Strength/Sequence

    cMRF, modified Look–Locker inversion recovery (MOLLI), and T2‐prepared balanced steady‐state free precession (bSSFP) at 1.5T.

    Assessment

    T1/T2values were measured in 16 myocardial segments at apical, medial, and basal slice positions. Test–retest and intrareader repeatability were assessed for the medial slice. cMRF and conventional mapping sequences were compared using ordinal and two alternative forced choice (2AFC) ratings.

    Statistical Tests

    Pairedt‐tests, Bland–Altman analyses, intraclass correlation coefficient (ICC), linear regression, one‐way analysis of variance (ANOVA), and binomial tests.

    Results

    Average T1measurements were: basal 1007.4±96.5 msec (cMRF), 990.0±45.3 msec (MOLLI); medial 995.0±101.7 msec (cMRF), 995.6±59.7 msec (MOLLI); apical 1006.6±111.2 msec (cMRF); and 981.6±87.6 msec (MOLLI). Average T2measurements were: basal 40.9±7.0 msec (cMRF), 46.1±3.5 msec (bSSFP); medial 41.0±6.4 msec (cMRF), 47.4±4.1 msec (bSSFP); apical 43.5±6.7 msec (cMRF), 48.0±4.0 msec (bSSFP). A statistically significant bias (cMRF T1larger than MOLLI T1) was observed in basal (17.4 msec) and apical (25.0 msec) slices. For T2, a statistically significant bias (cMRF lower than bSSFP) was observed for basal (–5.2 msec), medial (–6.3 msec), and apical (–4.5 msec) slices. Precision was lower for cMRF—the average of the standard deviation measured within each slice was 102 msec for cMRF vs. 61 msec for MOLLI T1, and 6.4 msec for cMRF vs. 4.0 msec for bSSFP T2. cMRF and conventional techniques had similar test–retest repeatability as quantified by ICC (0.87 cMRF vs. 0.84 MOLLI for T1; 0.85 cMRF vs. 0.85 bSSFP for T2). In the ordinal image quality comparison, cMRF maps scored higher than conventional sequences for both T1(all five features) and T2(four features).

    Data Conclusion

    This work reports on myocardial T1/T2measurements in healthy subjects using cMRF and standard mapping sequences. cMRF had slightly lower precision, similar test–retest and intrareader repeatability, and higher scores for map quality.

    Evidence Level

    2

    Technical Efficacy

    Stage 1 J. Magn. Reson. Imaging 2020;52:1044–1052.

     
    more » « less
  3. Abstract

    The availability of vast amounts of longitudinal data from electronic health records (EHRs) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within‐individual variability. We propose a scalable method, within‐subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time‐varying and time‐invariant predictors on within‐subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 103∼ 105times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is evaluated in extensive simulation studies. Its broad applicability is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.

     
    more » « less
  4. Abstract

    In causal inference problems, one is often tasked with estimating causal effects which are analytically intractable functionals of the data‐generating mechanism. Relevant settings include estimating intention‐to‐treat effects in longitudinal problems with missing data or computing direct and indirect effects in mediation analysis. One approach to computing these effects is to use theg‐formula implemented via Monte Carlo integration; when simulation‐based methods such as the nonparametric bootstrap or Markov chain Monte Carlo are used for inference, Monte Carlo integration must be nested within an already computationally intensive algorithm. We develop a widely‐applicable approach to accelerating this Monte Carlo integration step which greatly reduces the computational burden of existingg‐computation algorithms. We refer to our method as acceleratedg‐computation (AGC). The algorithms we present are similar in spirit to multiple imputation, but require removing within‐imputation variance from the standard error rather than adding it. We illustrate the use of AGC on a mediation analysis problem using a beta regression model and in a longitudinal clinical trial subject to nonignorable missingness using a Bayesian additive regression trees model.

     
    more » « less
  5. Abstract

    In metagenomic studies, testing the association between microbiome composition and clinical outcomes translates to testing the nullity of variance components. Motivated by a lung human immunodeficiency virus (HIV) microbiome project, we study longitudinal microbiome data by using variance component models with more than two variance components. Current testing strategies only apply to models with exactly two variance components and when sample sizes are large. Therefore, they are not applicable to longitudinal microbiome studies. In this paper, we propose exact tests (score test, likelihood ratio test, and restricted likelihood ratio test) to (a) test the association of the overall microbiome composition in a longitudinal design and (b) detect the association of one specific microbiome cluster while adjusting for the effects from related clusters. Our approach combines the exact tests for null hypothesis with a single variance component with a strategy of reducing multiple variance components to a single one. Simulation studies demonstrate that our method has a correct type I error rate and superior power compared to existing methods at small sample sizes and weak signals. Finally, we apply our method to a longitudinal pulmonary microbiome study of HIV‐infected patients and reveal two interesting generaPrevotellaandVeillonellaassociated with forced vital capacity. Our findings shed light on the impact of the lung microbiome on HIV complexities. The method is implemented in the open‐source, high‐performance computing languageJuliaand is freely available athttps://github.com/JingZhai63/VCmicrobiome.

     
    more » « less