Abstract The availability of vast amounts of longitudinal data from electronic health records (EHRs) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within‐individual variability. We propose a scalable method, within‐subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time‐varying and time‐invariant predictors on within‐subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 103∼ 105times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is evaluated in extensive simulation studies. Its broad applicability is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.
more »
« less
Bag of little bootstraps for massive and distributed longitudinal data
Abstract Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia packageMixedModelsBLB.jl.Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.
more »
« less
- PAR ID:
- 10446128
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Statistical Analysis and Data Mining: The ASA Data Science Journal
- Volume:
- 15
- Issue:
- 3
- ISSN:
- 1932-1864
- Page Range / eLocation ID:
- p. 314-321
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Population dynamics play a central role in the historical and current development of fundamental and applied ecological science. The nascent culture of open data promises to increase the value of population dynamics studies to the field of ecology. However, synthesis of population data is constrained by the difficulty in identifying relevant datasets, by the heterogeneity of available data and by access to raw (as opposed to aggregated or derived) observations.To obviate these issues, we built a relational database,popler, and itsRclient, the library popler.popleraccommodates the vast majority of population data under a common structure, and without the need for aggregating raw observations. The popler R library is designed for users unfamiliar with the structure of the database and with the SQL language. ThisRlibrary allows users to identify, download, explore and cite datasets salient to their needs.We implemented popler as a PostgreSQL instance, where we stored population data originated by the United States Long Term Ecological Research (LTER) Network. Our focus on the US LTER data aims to leverage the potential of this vast open data resource. The database currently contains 305 datasets from 25 LTER sites.popleris designed to accommodate automatic updates of existing datasets, and to accommodate additional datasets from LTER as well as non‐LTER studies.The combination of the online database and theRlibrary popler is a resource for data synthesis efforts in population ecology. The common structure ofpoplersimplifies comparative analyses, and the availability of raw data confers flexibility in data analysis. The popler R library maximizes these opportunities by providing a user‐friendly interface to the online database.more » « less
-
Abstract The production mechanism for terrestrial gamma ray flashes (TGFs) is not entirely understood, and details of the corresponding lightning activity and thunderstorm charge structure have yet to be fully characterized. Here we examine sub‐microsecond VHF (14–88 MHz) radio interferometer observations of a 247‐kA peak‐current EIP, or energetic in‐cloud pulse, a reliable radio signature of a subset of TGFs. The EIP consisted of three high‐amplitude sferic pulses lasting≃60μs in total, which peaked during the second (main) pulse. The EIP occurred during a normal‐polarity intracloud lightning flash that was highly unusual, in that the initial upward negative leader was particularly fast propagating and discharged a highly concentrated region of upper‐positive storm charge. The flash was initiated by a high‐power (46 kW) narrow bipolar event (NBE), and the EIP occurred about 3 ms later after≃3 km upward flash development. The EIP was preceded≃200μs by a fast6 × 106m/s upward negative breakdown and immediately preceded and accompanied by repeated sequences of fast (107–108m/s) downward then upward streamer events each lasting 10 to 20μs, which repeatedly discharged a large volume of positive charge. Although the repeated streamer sequences appeared to be a characteristic feature of the EIP and were presumably involved in initiating it, the EIP sferic evolved independently of VHF‐producing activity, supporting the idea that the sferic was produced by relativistic discharge currents. Moreover, the relativistic currents during the main sferic pulse initiated a strong NBE‐like event comparable in VHF power (115 kW) to the highest‐power NBEs.more » « less
-
spOccupancy: An R package for single‐species, multi‐species, and integrated spatial occupancy modelsAbstract Occupancy modelling is a common approach to assess species distribution patterns, while explicitly accounting for false absences in detection–nondetection data. Numerous extensions of the basic single‐species occupancy model exist to model multiple species, spatial autocorrelation and to integrate multiple data types. However, development of specialized and computationally efficient software to incorporate such extensions, especially for large datasets, is scarce or absent.We introduce thespOccupancy Rpackage designed to fit single‐species and multi‐species spatially explicit occupancy models. We fit all models within a Bayesian framework using Pólya‐Gamma data augmentation, which results in fast and efficient inference.spOccupancyprovides functionality for data integration of multiple single‐species detection–nondetection datasets via a joint likelihood framework. The package leverages Nearest Neighbour Gaussian Processes to account for spatial autocorrelation, which enables spatially explicit occupancy modelling for potentially massive datasets (e.g. 1,000s–100,000s of sites).spOccupancyprovides user‐friendly functions for data simulation, model fitting, model validation (by posterior predictive checks), model comparison (using information criteria and k‐fold cross‐validation) and out‐of‐sample prediction. We illustrate the package's functionality via a vignette, simulated data analysis and two bird case studies.ThespOccupancypackage provides a user‐friendly platform to fit a variety of single and multi‐species occupancy models, making it straightforward to address detection biases and spatial autocorrelation in species distribution models even for large datasets.more » « less
-
Abstract AimThe International Tree‐Ring Data Bank (ITRDB) is the most comprehensive database of tree growth. To evaluate its usefulness and improve its accessibility to the broad scientific community, we aimed to: (a) quantify its biases, (b) assess how well it represents global forests, (c) develop tools to identify priority areas to improve its representativity, and d) make available the corrected database. LocationWorldwide. Time periodContributed datasets between 1974 and 2017. Major taxa studiedTrees. MethodsWe identified and corrected formatting issues in all individual datasets of theITRDB. We then calculated the representativity of theITRDBwith respect to species, spatial coverage, climatic regions, elevations, need for data update, climatic limitations on growth, vascular plant diversity, and associated animal diversity. We combined these metrics into a global Priority Sampling Index (PSI) to highlight ways to improveITRDBrepresentativity. ResultsOur refined dataset provides access to a network of >52 million growth data points worldwide. We found, however, that the database is dominated by trees from forests with low diversity, in semi‐arid climates, coniferous species, and in western North America. Conifers represented 81% of theITRDBand even in well‐sampled areas, broadleaves were poorly represented. OurPSIstressed the need to increase the database diversity in terms of broadleaf species and identified poorly represented regions that require scientific attention. Great gains will be made by increasing research and data sharing in African, Asian, and South American forests. Main conclusionsThe extensive data and coverage of theITRDBshow great promise to address macroecological questions. To achieve this, however, we have to overcome the significant gaps in the representativity of theITRDB. A strategic and organized group effort is required, and we hope the tools and data provided here can guide the efforts to improve this invaluable database.more » « less
An official website of the United States government
