skip to main content


Search for: All records

Award ID contains: 2054253

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Background

    Idiopathic pulmonary fibrosis (IPF) is a progressive, irreversible, and usually fatal lung disease of unknown reasons, generally affecting the elderly population. Early diagnosis of IPF is crucial for triaging patients’ treatment planning into anti‐fibrotic treatment or treatments for other causes of pulmonary fibrosis. However, current IPF diagnosis workflow is complicated and time‐consuming, which involves collaborative efforts from radiologists, pathologists, and clinicians and it is largely subject to inter‐observer variability.

    Purpose

    The purpose of this work is to develop a deep learning‐based automated system that can diagnose subjects with IPF among subjects with interstitial lung disease (ILD) using an axial chest computed tomography (CT) scan. This work can potentially enable timely diagnosis decisions and reduce inter‐observer variability.

    Methods

    Our dataset contains CT scans from 349 IPF patients and 529 non‐IPF ILD patients. We used 80% of the dataset for training and validation purposes and 20% as the holdout test set. We proposed a two‐stage model: at stage one, we built a multi‐scale, domain knowledge‐guided attention model (MSGA) that encouraged the model to focus on specific areas of interest to enhance model explainability, including both high‐ and medium‐resolution attentions; at stage two, we collected the output from MSGA and constructed a random forest (RF) classifier for patient‐level diagnosis, to further boost model accuracy. RF classifier is utilized as a final decision stage since it is interpretable, computationally fast, and can handle correlated variables. Model utility was examined by (1) accuracy, represented by the area under the receiver operating characteristic curve (AUC) with standard deviation (SD), and (2) explainability, illustrated by the visual examination of the estimated attention maps which showed the important areas for model diagnostics.

    Results

    During the training and validation stage, we observe that when we provide no guidance from domain knowledge, the IPF diagnosis model reaches acceptable performance (AUC±SD = 0.93±0.07), but lacks explainability; when including only guided high‐ or medium‐resolution attention, the learned attention maps are not satisfactory; when including both high‐ and medium‐resolution attention, under certain hyperparameter settings, the model reaches the highest AUC among all experiments (AUC±SD = 0.99±0.01) and the estimated attention maps concentrate on the regions of interests for this task. Three best‐performing hyperparameter selections according to MSGA were applied to the holdout test set and reached comparable model performance to that of the validation set.

    Conclusions

    Our results suggest that, for a task with only scan‐level labels available, MSGA+RF can utilize the population‐level domain knowledge to guide the training of the network, which increases both model accuracy and explainability.

     
    more » « less
  2. Summary

    Nan Laird has an enormous and growing impact on computational statistics. Her paper with Dempster and Rubin on the expectation‐maximisation (EM) algorithm is the second most cited paper in statistics. Her papers and book on longitudinal modelling are nearly as impressive. In this brief survey, we revisit the derivation of some of her most useful algorithms from the perspective of the minorisation‐maximisation (MM) principle. The MM principle generalises the EM principle and frees it from the shackles of missing data and conditional expectations. Instead, the focus shifts to the construction of surrogate functions via standard mathematical inequalities. The MM principle can deliver a classical EM algorithm with less fuss or an entirely new algorithm with a faster rate of convergence. In any case, the MM principle enriches our understanding of the EM principle and suggests new algorithms of considerable potential in high‐dimensional settings where standard algorithms such as Newton's method and Fisher scoring falter.

     
    more » « less
  3. Abstract Background

    Low-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce to analyze population structure of low-depth sequencing data.

    Results

    The method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common.

    Conclusions

    We apply to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The package is available onhttps://github.com/yiwenstat/MCPCA_PopGen.

     
    more » « less
  4. Abstract Background

    Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools.

    Results

    We present , an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research.

    Conclusion

    The package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.

     
    more » « less
  5. Abstract

    Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia packageMixedModelsBLB.jl.Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.

     
    more » « less
  6. Abstract

    The availability of vast amounts of longitudinal data from electronic health records (EHRs) and personal wearable devices opens the door to numerous new research questions. In many studies, individual variability of a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic variations, and mood swings are prime examples where it is critical to identify factors that affect the within‐individual variability. We propose a scalable method, within‐subject variance estimator by robust regression (WiSER), for the estimation and inference of the effects of both time‐varying and time‐invariant predictors on within‐subject variance. It is robust against the misspecification of the conditional distribution of responses or the distribution of random effects. It shows similar performance as the correctly specified likelihood methods but is 103∼ 105times faster. The estimation algorithm scales linearly in the total number of observations, making it applicable to massive longitudinal data sets. The effectiveness of WiSER is evaluated in extensive simulation studies. Its broad applicability is illustrated using the accelerometry data from the Women's Health Study and a clinical trial for longitudinal diabetes care.

     
    more » « less
  7. Tumor subtype and menopausal status are strong predictors of breast cancer (BC) prognosis. We aimed to find and validate subtype- or menopausal-status-specific changes in tumor DNA methylation (DNAm) associated with all-cause mortality or BC progression. Associations between site-specific tumor DNAm and BC prognosis were estimated among The Cancer Genome Atlas participants ( n = 692) with Illumina Infinium HumanMethylation450 BeadChip array data. All-cause mortality and BC progression were modeled using Cox proportional hazards models stratified by tumor subtypes, adjusting for age, race, stage, menopausal status, tumor purity, and cell type proportion. Effect measure modification by subtype and menopausal status were evaluated by incorporating a product term with DNAm. Site-specific inference was used to identify subtype- or menopausal-status-specific differentially methylated regions (DMRs) and functional pathways. The validation of the results was carried out on an independent dataset (GSE72308; n = 180). We identified a total of fifteen unique CpG probes that were significantly associated ( P ≤ 1 × 10 − 7 with survival outcomes in subtype- or menopausal-status-specific manner. Seven probes were associated with overall survival (OS) or progression-free interval (PFI) for women with luminal A subtype, and four probes were associated with PFI for women with luminal B subtype. Five probes were associated with PFI for post-menopausal women. A majority of significant probes showed a lower risk of OS or BC progression with higher DNAm. We identified subtype- or menopausal-status-specific DMRs and functional pathways of which top associated pathways differed across subtypes or menopausal status. None of significant probes from site-specific analyses met genome-wide significant level in validation analyses while directions and magnitudes of coefficients showed consistent pattern. We have identified subtype- or menopausal-status-specific DNAm biomarkers, DMRs and functional pathways associated with all-cause mortality or BC progression, albeit with limited validation. Future studies with larger independent cohort of non-post-menopausal women with non-luminal A subtypes are warranted for identifying subtype- and menopausal-status-specific DNAm biomarkers for BC prognosis. 
    more » « less
  8. Abstract Objective Modern healthcare data reflect massive multi-level and multi-scale information collected over many years. The majority of the existing phenotyping algorithms use case–control definitions of disease. This paper aims to study the time to disease onset and progression and identify the time-varying risk factors that drive them. Materials and Methods We developed an algorithmic approach to phenotyping the incidence of diseases by consolidating data sources from the UK Biobank (UKB), including primary care electronic health records (EHRs). We focused on defining events, event dates, and their censoring time, including relevant terms and existing phenotypes, excluding generic, rare, or semantically distant terms, forward-mapping terminology terms, and expert review. We applied our approach to phenotyping diabetes complications, including a composite cardiovascular disease (CVD) outcome, diabetic kidney disease (DKD), and diabetic retinopathy (DR), in the UKB study. Results We identified 49 049 participants with diabetes. Among them, 1023 had type 1 diabetes (T1D), and 40 193 had type 2 diabetes (T2D). A total of 23 833 diabetes subjects had linked primary care records. There were 3237, 3113, and 4922 patients with CVD, DKD, and DR events, respectively. The risk prediction performance for each outcome was assessed, and our results are consistent with the prediction area under the ROC (receiver operating characteristic) curve (AUC) of standard risk prediction models using cohort studies. Discussion and Conclusion Our publicly available pipeline and platform enable streamlined curation of incidence events, identification of time-varying risk factors underlying disease progression, and the definition of a relevant cohort for time-to-event analyses. These important steps need to be considered simultaneously to study disease progression. 
    more » « less