skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels
Abstract We construct Bayesian and frequentist finite-sample goodness-of-fit tests for three different variants of the stochastic blockmodel for network data. Since all of the stochastic blockmodel variants are log-linear in form when block assignments are known, the tests for the latent block model versions combine a block membership estimator with the algebraic statistics machinery for testing goodness-of-fit in log-linear models. We describe Markov bases and marginal polytopes of the variants of the stochastic blockmodel and discuss how both facilitate the development of goodness-of-fit tests and understanding of model behaviour. The general testing methodology developed here extends to any finite mixture of log-linear models on discrete data, and as such is the first application of the algebraic statistics machinery for latent-variable models.  more » « less
Award ID(s):
1947919
PAR ID:
10463118
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
86
Issue:
1
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 90-121
Size(s):
p. 90-121
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of suchmultiviewnetwork data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two‐view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein–protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views. 
    more » « less
  2. Abstract Environmental decisions with substantial social and environmental implications are regularly informed by model predictions, incurring inevitable uncertainty. The selection of a set of model predictions to inform a decision is usually based on model performance, measured by goodness‐of‐fit metrics. Yet goodness‐of‐fit metrics have a questionable relationship to a model's value to end users, particularly when validation data are themselves uncertain. For example, decisions based on flow frequency models are not necessarily improved by adopting models with the best overall goodness of fit. We propose an alternative model evaluation approach based on the conditional value of sample information, first defined in 1961, which has found extensive use in sampling design optimization but which has not previously been used for model evaluation. The metric uses observations from a validation set to estimate the expected monetary costs associated with model prediction uncertainties. A model is only considered superior to alternatives if (i) its predictions reduce these costs and (ii) sufficient validation data are available to distinguish its performance from alternative models. By describing prediction uncertainties in monetary terms, the metric facilitates the communication of prediction uncertainty by end users, supporting the inclusion of uncertainty analysis in decision making. 
    more » « less
  3. ABSTRACT A random algebraic graph is defined by a group with a uniform distribution over it and a connection with expectation satisfying . The random graph with vertex set is formed as follows. First, independent variables are sampled uniformly from . Then, vertices are connected with probability . This model captures random geometric graphs over the sphere, torus, and hypercube; certain instances of the stochastic block model; and random subgraphs of Cayley graphs. The main question of interest to the current paper is: when is a random algebraic graph statistically and/or computationally distinguishable from ? Our results fall into two categories. (1) Geometric. We focus on the case and use Fourier‐analytic tools. We match and extend the following results from the prior literature: For hard threshold connections, we match for , and for ‐Lipschitz connections we extend the results of when to the non‐monotone setting. (2) Algebraic. We provide evidence for an exponential statistical‐computational gap. Consider any finite group and let be a set of elements formed by including each set of the form independently with probability Let be the distribution of random graphs formed by taking a uniformly random induced subgraph of size of the Cayley graph . Then, and are statistically indistinguishable with high probability over if and only if . However, low‐degree polynomial tests fail to distinguish and with high probability over when 
    more » « less
  4. Stein’s method compares probability distributions through the study of a class of linear operators called Stein operators. While mainly studied in probability and used to underpin theoretical statistics, Stein’s method has led to significant advances in computational statistics in recent years. The goal of this survey is to bring together some of these recent developments, and in doing so, to stimulate further research into the successful field of Stein’s method and statistics. The topics we discuss include tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, parameter estimation and goodness-of-fit testing. 
    more » « less
  5. Various goodness-of-fit tests are designed based on the so-called information matrix equivalence: if the assumed model is correctly specified, two information matrices that are derived from the likelihood function are equivalent. In the literature, this principle has been established for the likelihood function with fully observed data, but it has not been verified under the likelihood for censored data. In this manuscript, we prove the information matrix equivalence in the framework of semiparametric copula models for multivariate censored survival data. Based on this equivalence, we propose an information ratio (IR) test for the specification of the copula function. The IR statisticis constructed via comparing consistent estimates of the two information matrices. We derive the asymptotic distribution of the IR statistic and propose a parametric bootstrap procedure for the finite-sample P-value calculation. The performance of the IR test is investigated via a simulation study and a real data example. 
    more » « less