skip to main content

Title: Bayesian inference of networks across multiple sample groups and data types
Summary In this article, we develop a graphical modeling framework for the inference of networks across multiple sample groups and data types. In medical studies, this setting arises whenever a set of subjects, which may be heterogeneous due to differing disease stage or subtype, is profiled across multiple platforms, such as metabolomics, proteomics, or transcriptomics data. Our proposed Bayesian hierarchical model first links the network structures within each platform using a Markov random field prior to relate edge selection across sample groups, and then links the network similarity parameters across platforms. This enables joint estimation in a flexible manner, as we make no assumptions on the directionality of influence across the data types or the extent of network similarity across the sample groups and platforms. In addition, our model formulation allows the number of variables and number of subjects to differ across the data types, and only requires that we have data for the same set of groups. We illustrate the proposed approach through both simulation studies and an application to gene expression levels and metabolite abundances on subjects with varying severity levels of chronic obstructive pulmonary disease. Bayesian inference; Chronic obstructive pulmonary disease (COPD); Data integration; Gaussian graphical model; more » Markov random field prior; Spike and slab prior. « less
Authors:
; ; ; ; ; ; ;
Award ID(s):
1811568 1811445
Publication Date:
NSF-PAR ID:
10282715
Journal Name:
Biostatistics
Volume:
21
Issue:
3
Page Range or eLocation-ID:
561 to 576
ISSN:
1468-4357
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.

    Results

    We develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.

    Availability and implementation

    The NExUS source code is freely available for download at https://github.com/priyamdas2/NExUS.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

  2. Abstract

    Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, which corresponds to a special class of random partition models. Second, we propose a more realistic distortion model for categorical/discrete record attributes, which corrects a logical inconsistency with the standard hit-miss model. Third, we incorporate hyperpriors to improve flexibility. Fourth, we employ a partially collapsed Gibbs sampler for inferential speedups. Using a selection of private and nonprivate data sets, we investigate the impact of our modeling contributions and compare our model with two alternative Bayesian models. In addition, we conduct a simulation study for household survey data, where we vary distortion, duplication rates and data set size. We find that our model performs more consistently than the alternatives across a variety of scenarios and typically achieves the highest entity resolution accuracy (F1 score). Open source software is available for our proposed methodology, and we provide a discussion regarding our workmore »and future directions.

    « less
  3. Abstract Motivation

    Protein function prediction, based on the patterns of connection in a protein–protein interaction (or association) network, is perhaps the most studied of the classical, fundamental inference problems for biological networks. A highly successful set of recent approaches use random walk-based low-dimensional embeddings that tend to place functionally similar proteins into coherent spatial regions. However, these approaches lose valuable local graph structure from the network when considering only the embedding. We introduce GLIDER, a method that replaces a protein–protein interaction or association network with a new graph-based similarity network. GLIDER is based on a variant of our previous GLIDE method, which was designed to predict missing links in protein–protein association networks, capturing implicit local and global (i.e. embedding-based) graph properties.

    Results

    GLIDER outperforms competing methods on the task of predicting GO functional labels in cross-validation on a heterogeneous collection of four human protein–protein association networks derived from the 2016 DREAM Disease Module Identification Challenge, and also on three different protein–protein association networks built from the STRING database. We show that this is due to the strong functional enrichment that is present in the local GLIDER neighborhood in multiple different types of protein–protein association networks. Furthermore, we introduce the GLIDER graph neighborhoodmore »as a way for biologists to visualize the local neighborhood of a disease gene. As an application, we look at the local GLIDER neighborhoods of a set of known Parkinson’s Disease GWAS genes, rediscover many genes which have known involvement in Parkinson’s disease pathways, plus suggest some new genes to study.

    Availability and implementation

    All code is publicly available and can be accessed here: https://github.com/kap-devkota/GLIDER.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  4. Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide with over 3 × 106deaths in 2019. Such an alarming figure becomes frightening when combined with the number of lost lives resulting from COVID-caused respiratory failure. Because COPD exacerbations identified early can commonly be treated at home, early symptom detections may enable a major reduction of COPD patient readmission and associated healthcare costs; this is particularly important during pandemics such as COVID-19 in which healthcare facilities are overwhelmed. The standard adjuncts used to assess lung function (e.g., spirometry, plethysmography, and CT scan) are expensive, time consuming, and cannot be used in remote patient monitoring of an acute exacerbation. In this paper, a wearable multi-modal system for breathing analysis is presented, which can be used in quantifying various airflow obstructions. The wearable multi-modal electroacoustic system employs a body area sensor network with each sensor-node having a multi-modal sensing capability, such as a digital stethoscope, electrocardiogram monitor, thermometer, and goniometer. The signal-to-noise ratio (SNR) of the resulting acoustic spectrum is used as a measure of breathing intensity. The results are shown from data collected from over 35 healthy subjects and 3 COPD subjects, demonstrating a positive correlation of SNR values tomore »the health-scale score.

    « less
  5. Alfonso, Valencia (Ed.)
    Abstract Motivation There is growing interest in the biomedical research community to incorporate retrospective data, available in healthcare systems, to shed light on associations between different biomarkers. Understanding the association between various types of biomedical data, such as genetic, blood biomarkers, imaging, etc. can provide a holistic understanding of human diseases. To formally test a hypothesized association between two types of data in Electronic Health Records (EHRs), one requires a substantial sample size with both data modalities to achieve a reasonable power. Current association test methods only allow using data from individuals who have both data modalities. Hence, researchers cannot take advantage of much larger EHR samples that includes individuals with at least one of the data types, which limits the power of the association test. Results We present a new method called the Semi-paired Association Test (SAT) that makes use of both paired and unpaired data. In contrast to classical approaches, incorporating unpaired data allows SAT to produce better control of false discovery and to improve the power of the association test. We study the properties of the new test theoretically and empirically, through a series of simulations and by applying our method on real studies in the contextmore »of Chronic Obstructive Pulmonary Disease. We are able to identify an association between the high-dimensional characterization of Computed Tomography chest images and several blood biomarkers as well as the expression of dozens of genes involved in the immune system. Availability and implementation Code is available on https://github.com/batmanlab/Semi-paired-Association-Test. Supplementary information Supplementary data are available at Bioinformatics online.« less