skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Inference on the History of a Randomly Growing Tree
Abstract The spread of infectious disease in a human community or the proliferation of fake news on social media can be modelled as a randomly growing tree-shaped graph. The history of the random growth process is often unobserved but contains important information such as the source of the infection. We consider the problem of statistical inference on aspects of the latent history using only a single snapshot of the final tree. Our approach is to apply random labels to the observed unlabelled tree and analyse the resulting distribution of the growth process, conditional on the final outcome. We show that this conditional distribution is tractable under a shape exchangeability condition, which we introduce here, and that this condition is satisfied for many popular models for randomly growing trees such as uniform attachment, linear preferential attachment and uniform attachment on a D-regular tree. For inference of the root under shape exchangeability, we propose O(n log n) time algorithms for constructing confidence sets with valid frequentist coverage as well as bounds on the expected size of the confidence sets. We also provide efficient sampling algorithms which extend our methods to a wide class of inference problems.  more » « less
Award ID(s):
2113671
PAR ID:
10398642
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
83
Issue:
4
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 639-668
Size(s):
p. 639-668
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Many statistical models for networks overlook the fact that most real-world networks are formed through a growth process. To address this, we introduce the Preferential Attachment Plus Erdős–Rényi model, where we let a random network G be the union of a preferential attachment (PA) tree T and additional Erdős–Rényi (ER) random edges. The PA tree captures the underlying growth process of a network where vertices/edges are added sequentially, while the ER component can be regarded as noise. Given only one snapshot of the final network G, we study the problem of constructing confidence sets for the root node of the unobserved growth process; the root node can be patient zero in an infection network or the source of fake news in a social network. We propose inference algorithms based on Gibbs sampling that scales to networks with millions of nodes and provide theoretical analysis showing that the size of the confidence set is small if the noise level of the ER edges is not too large. We also propose variations of the model in which multiple growth processes occur simultaneously, reflecting the growth of multiple communities; we use these models to provide a new approach to community detection. 
    more » « less
  2. Construction of tight confidence sets and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence sets for categorical data. More precisely, consider an empirical distribution pˆ generated from n iid realizations of a random variable that takes one of k possible values according to an unknown distribution p . This is analogous to a single draw from a multinomial distribution. A confidence set is a subset of the probability simplex that depends on pˆ and contains the unknown p with a specified confidence. This paper shows how one can construct minimum average volume confidence sets. The optimality of the sets translates to improved sample complexity for adaptive machine learning algorithms that rely on confidence sets, regions and intervals. 
    more » « less
  3. We study the inference of network archaeology in growing random geometric graphs. We consider the root finding problem for a random nearest neighbor tree in dimension d∈N, generated by sequentially embedding vertices uniformly at random in the d-dimensional torus and connecting each new vertex to the nearest existing vertex. More precisely, given an error parameter ε>0 and the unlabeled tree, we want to efficiently find a small set of candidate vertices, such that the root is included in this set with probability at least 1−ε. We call such a candidate set a confidence set. We define several variations of the root finding problem in geometric settings -- embedded, metric, and graph root finding -- which differ based on the nature of the type of metric information provided in addition to the graph structure (torus embedding, edge lengths, or no additional information, respectively). We show that there exist efficient root finding algorithms for embedded and metric root finding. For embedded root finding, we derive upper and lower bounds (uniformly bounded in n) on the size of the confidence set: the upper bound is subpolynomial in 1/ε and stems from an explicit efficient algorithm, and the information-theoretic lower bound is polylogarithmic in 1/ε. In particular, in d=1, we obtain matching upper and lower bounds for a confidence set of size Θ(log(1/ε)loglog(1/ε)). 
    more » « less
  4. A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under the sole assumption of data exchangeability. Although our solution is broadly applicable, this paper focuses on applications involving the count-min sketch algorithm and a non-linear variation thereof. The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature. 
    more » « less
  5. Abstract Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data. 
    more » « less