Abstract In many applications of hierarchical models, there is often interest in evaluating the inherent heterogeneity in view of observed data. When the underlying hypothesis involves parameters resting on the boundary of their support space such as variances and mixture proportions, it is a usual practice to entertain testing procedures that rely on common heterogeneity assumptions. Such procedures, albeit omnibus for general alternatives, may entail a substantial loss of power for specific alternatives such as heterogeneity varying with covariates. We introduce a novel and flexible approach that uses covariate information to improve the power to detect heterogeneity, without imposing unnecessary restrictions. With continuous covariates, the approach does not impose a regression model relating heterogeneity parameters to covariates or rely on arbitrary discretizations. Instead, a scanning approach requiring continuous dichotomizations of the covariates is proposed. Empirical processes resulting from these dichotomizations are then used to construct the test statistics, with limiting null distributions shown to be functionals of tight random processes. We illustrate our proposals and results on a popular class of two-component mixture models, followed by simulation studies and applications to two real datasets in cancer and caries research.
more »
« less
Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data
Summary For large-scale testing with graph-associated data, we present an empirical Bayes mixture technique to score local false-discovery rates (FDRs). Compared to procedures that ignore the graph, the proposed Graph-based Mixture Model (GraphMM) method gains power in settings where non-null cases form connected subgraphs, and it does so by regularizing parameter contrasts between testing units. Simulations show that GraphMM controls the FDR in a variety of settings, though it may lose control with excessive regularization. On magnetic resonance imaging data from a study of brain changes associated with the onset of Alzheimer’s disease, GraphMM produces greater yield than conventional large-scale testing procedures.
more »
« less
- PAR ID:
- 10280952
- Date Published:
- Journal Name:
- Biostatistics
- ISSN:
- 1465-4644
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In many real-world applications, graph-structured data used for training and testing have differences in distribution, such as in high energy physics (HEP) where simulation data used for training may not match real experiments. Graph domain adaptation (GDA) is a method used to address these differences. However, current GDA primarily works by aligning the distributions of node representations output by a single graph neural network encoder shared across the training and testing domains, which may often yield sub-optimal solutions. This work examines different impacts of distribution shifts caused by either graph structure or node attributes and identifies a new type of shift, named conditional structure shift (CSS), which current GDA approaches are provably sub-optimal to deal with. A novel approach, called structural reweighting (StruRW), is proposed to address this issue and is tested on synthetic graphs, four benchmark datasets, and a new application in HEP. StruRW has shown significant performance improvement over the baselines in the settings with large graph structure shifts, and reasonable performance improvement when node attribute shift dominates.more » « less
-
Abstract The simultaneous testing of multiple hypotheses is common to the analysis of high‐dimensional data sets. The two‐group model, first proposed by Efron, identifies significant comparisons by allocating observations to a mixture of an empirical null and an alternative distribution. In the Bayesian nonparametrics literature, many approaches have suggested using mixtures of Dirichlet Processes in the two‐group model framework. Here, we investigate employing mixtures of two‐parameter Poisson‐Dirichlet Processes instead, and show how they provide a more flexible and effective tool for large‐scale hypothesis testing. Our model further employs nonlocal prior densities to allow separation between the two mixture components. We obtain a closed‐form expression for the exchangeable partition probability function of the two‐group model, which leads to a straightforward Markov Chain Monte Carlo implementation. We compare the performance of our method for large‐scale inference in a simulation study and illustrate its use on both a prostate cancer data set and a case‐control microbiome study of the gastrointestinal tracts in children from underdeveloped countries who have been recently diagnosed with moderate‐to‐severe diarrhea.more » « less
-
Singh, Mona (Ed.)Microbial associations are characterized by both direct and indirect interactions between the constituent taxa in a microbial community, and play an important role in determining the structure, organization, and function of the community. Microbial associations can be represented using a weighted graph (microbial network) whose nodes represent taxa and edges represent pairwise associations. A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa counts in each sample. However, it is known that microbial associations are impacted by environmental and/or host factors. Thus, a sample-taxa matrix generated in a microbiome study involving a wide range of values for the environmental and/or clinical metadata variables may in fact be associated with more than one microbial network. Here we consider the problem of inferring multiple microbial networks from a given sample-taxa count matrix. Each sample is a count vector assumed to be generated by a mixture model consisting of component distributions that are Multivariate Poisson Log-Normal. We present a variational Expectation Maximization algorithm for the model selection problem to infer the correct number of components of this mixture model. Our approach involves reframing the mixture model as a latent variable model, treating only the mixing coefficients as parameters, and subsequently approximating the marginal likelihood using an evidence lower bound framework. Our algorithm is evaluated on a large simulated dataset generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).more » « less
-
Microbial communities are often composed of taxa from different taxonomic groups. The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample–taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).more » « less