skip to main content


Title: Down‐set thresholds
Abstract

We elucidate the relationship between the threshold and the expectation‐threshold of a down‐set. Qualitatively, our main result demonstrates that there exist down‐sets with polynomial gaps between their thresholds and expectation‐thresholds; in particular, the logarithmic gap predictions of Kahn–Kalai and Talagrand (recently proved by Park–Pham and Frankston–Kahn–Narayanan–Park) about up‐sets do not apply to down‐sets. Quantitatively, we show that any collection of graphs on that covers the family of all triangle‐free graphs on satisfies the inequality for some universal , and this is essentially best‐possible.

 
more » « less
Award ID(s):
2103154
PAR ID:
10419745
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Random Structures & Algorithms
Volume:
63
Issue:
2
ISSN:
1042-9832
Page Range / eLocation ID:
p. 442-456
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Genomic data are increasingly used for high resolution population genetic studies including those at the forefront of biological conservation. A key methodological challenge is determining sequence similarity clustering thresholds for RADseq data when no reference genome is available. These thresholds define the maximum permitted divergence among allelic variants and the minimum divergence among putative paralogues and are central to downstream population genomic analyses. Here we develop a novel set of metrics to determine sequence similarity thresholds that maximize the correct separation of paralogous regions and minimize oversplitting naturally occurring allelic variation within loci. These metrics empirically identify the threshold value at which true alleles at opposite ends of several major axes of genetic variation begin to incorrectly separate into distinct clusters, allowing researchers to choose thresholds just below this value. We test our approach on a recently published data set for the protected foothill yellow‐legged frog (Rana boylii). The metrics recover a consistent pattern of roughly 96% similarity as a threshold above which genetic divergence and data missingness become increasingly correlated. We provide scripts for assessing different clustering thresholds and discuss how this approach can be applied across a wide range of empirical data sets.

     
    more » « less
  2. Abstract

    Thresholds in the relationship between species richness and natural land cover can inform landscape‐level vegetation protection and restoration targets. However, landscapes differ considerably in composition and other environmental attributes. If the effect of natural land cover on species richness depends on (i.e., interacts with) these attributes, and this affects the value of thresholds in this relationship, such dependencies must be considered when using thresholds to guide landscape management.

    We hypothesized that the amount of natural land cover at which a threshold occurs would differ in predictable ways with particular anthropogenic, abiotic, and biotic attributes of landscapes. To test this, we related woodland bird species richness in 251 landscapes, each 100 km2, to natural land cover in south‐east Australia. We compared the fit of exponential and threshold models of the richness–natural land cover relationship, focussing on the extent of natural land cover at which thresholds presented among landscapes that differed in matrix land use intensity, heterogeneity, productivity, and the prevalence of strong biotic interactors. We used linear mixed modelling to examine how interactions between natural land cover and the various landscape attributes affected the fit of models of species richness.

    Threshold models of the richness–natural land cover relationship were always a better fit than exponential models. Threshold values did not vary consistently with specific landscape attributes, with the exception of landscapes that were classified by the prevalence of strong biotic interactors (hypercompetitive native birds of the genusManorina).

    Natural land cover had a more positive effect on species richness in landscapes whenManorinaprevalence was higher. This positive interaction provided the biggest improvement in explanatory power of models of species richness.

    Synthesis and applications. While we detected an interaction betweenManorinaprevalence and the area of natural land cover, generalities relating to the underlying nature of thresholds in the richness–natural land cover relationship remain elusive. Complex interactions, relating to various landscape attributes and associated ecological processes, likely underpin variation in threshold values. Until these complexities are better understood, the use of thresholds for informing landscape management and conservation target setting should be approached with caution.

     
    more » « less
  3. Many large-scale recommender systems consist of two stages. The first stage efficiently screens the complete pool of items for a small subset of promising candidates, from which the second-stage model curates the final recommendations. In this paper, we investigate how to ensure group fairness to the items in this two-stage architecture. In particular, we find that existing first-stage recommenders might select an irrecoverably unfair set of candidates such that there is no hope for the second-stage recommender to deliver fair recommendations. To this end, motivated by recent advances in uncertainty quantification, we propose two threshold-policy selection rules that can provide distribution-free and finite-sample guarantees on fairness in first-stage recommenders. More concretely, given any relevance model of queries and items and a point-wise lower confidence bound on the expected number of relevant items for each threshold-policy, the two rules find near-optimal sets of candidates that contain enough relevant items in expectation from each group of items. To instantiate the rules, we demonstrate how to derive such confidence bounds from potentially partial and biased user feedback data, which are abundant in many large-scale recommender systems. In addition, we provide both finite-sample and asymptotic analyses of how close the two threshold selection rules are to the optimal thresholds. Beyond this theoretical analysis, we show empirically that these two rules can consistently select enough relevant items from each group while minimizing the size of the candidate sets for a wide range of settings. 
    more » « less
  4. Abstract Motivation

    Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets.

    Results

    We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.

    Availability and implementation

    Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract

    We give polynomial time algorithms for the seminal results of Kahn, who showed that the Goldberg–Seymour and list‐coloring conjectures for (list‐)edge coloring multigraphs hold asymptotically. Kahn's arguments are based on the probabilistic method and are non‐constructive. Our key insight is that we can combine sophisticated techniques due to Achlioptas, Iliopoulos, and Kolmogorov for the analysis of local search algorithms with correlation decay properties of the probability spaces on matchings used by Kahn in order to construct efficient edge‐coloring algorithms.

     
    more » « less