skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, September 13 until 2:00 AM ET on Saturday, September 14 due to maintenance. We apologize for the inconvenience.


Title: Proximity Search in the Greedy Tree
Over the last 50 years, there have been many data structures proposed to perform proximity search problems on metric data. Perhaps the simplest of these is the ball tree, which was independently discovered multiple times over the years. However, there is a lack of strong theoretical guarantees for standard ball trees, often leading to more complicated structures when guarantees are required. In this paper, we present the greedy tree, a simple ball tree construction for which we can prove strong theoretical guarantees for proximity search queries, matching the state of the art under reasonable assumptions. To our knowledge, this is the first ball tree construction providing such guarantees. Like a standard ball tree, it is a binary tree with the points stored in the leaves. Only a point, a radius, and an integer are stored for each node. The asymptotic running times of search algorithms in the greedy tree match those of more complicated structures regularly used in practice.  more » « less
Award ID(s):
2017980
NSF-PAR ID:
10422647
Author(s) / Creator(s):
; ; ;
Editor(s):
Telikepalli Kavitha; Kurt Mehlhorn
Date Published:
Journal Name:
SIAM Symposium on Simplicity in Algorithms
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and non-convex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms come with a variety of tunable parameters that are notoriously challenging to tune by hand. A growing body of research has demonstrated the power of using a data-driven approach to automatically optimize the parameters of tree search algorithms. These techniques use atraining setof integer programs sampled from an application-specific instance distribution to find a parameter setting that has strong average performance over the training set. However, with too few samples, a parameter setting may have strong average performance on the training set but poor expected performance on future integer programs from the same application. Our main contribution is to provide the firstsample complexity guaranteesfor tree search parameter tuning. These guarantees bound the number of samples sufficient to ensure that the average performance of tree search over the samples nearly matches its future expected performance on the unknown instance distribution. In particular, the parameters we analyze weightscoring rulesused for variable selection. Proving these guarantees is challenging because tree size is a volatile function of these parameters: we prove that, for any discretization (uniform or not) of the parameter space, there exists a distribution over integer programs such that every parameter setting in the discretization results in a tree with exponential expected size, yet there exist parameter settings between the discretized points that result in trees of constant size. In addition, we provide data-dependent guarantees that depend on the volatility of these tree-size functions: our guarantees improve if the tree-size functions can be well approximated by simpler functions. Finally, via experiments, we illustrate that learning an optimal weighting of scoring rules reduces tree size.

     
    more » « less
  2. null (Ed.)
    Abstract Directed acyclic graphical models are widely used to represent complex causal systems. Since the basic task of learning such a model from data is NP-hard, a standard approach is greedy search over the space of directed acyclic graphs or Markov equivalence classes of directed acyclic graphs. As the space of directed acyclic graphs on p nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation-based greedy searches. Here, we provide the first consistency guarantees, both uniform and high-dimensional, of a greedy permutation-based search. This search corresponds to a simplex-like algorithm operating over the edge-graph of a subpolytope of the permutohedron, called a directed acyclic graph associahedron. Every vertex in this polytope is associated with a directed acyclic graph, and hence with a collection of permutations that are consistent with the directed acyclic graph ordering. A walk is performed on the edges of the polytope maximizing the sparsity of the associated directed acyclic graphs. We show via simulated and real data that this permutation search is competitive with current approaches. 
    more » « less
  3. Measuring corticosterone in feathers allows researchers to make long-term, retrospective assessments of physiology with non-invasive sampling. To date, there is little evidence that steroids degrade within the feather matrix, however this has yet to be determined from the same sample over many years. In 2009, we made a pool of European starling (Sturnus vulgaris) feathers that had been ground to a homogenous powder using a ball mill and stored on a laboratory bench. Over the past 14 years, a subset of this pooled sample has been assayed via radioimmunoassay (RIA) 19 times to quantify corticosterone. Despite high variability across time (though low variability within assays), there was no effect of time on measured feather corticosterone concentration. In contrast, two enzyme immunoassays (EIA) produced higher concentrations than the samples assayed with RIA, though this difference is likely due to different binding affinities of the antibodies used. The present study provides further support for researchers to use specimens stored long-term and from museums for feather corticosterone quantification, and likely applies to corticosteroid measurements in other keratinized tissues. 
    more » « less
  4. Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and nonconvex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms recursively partition the search space to find an optimal solution. In order to keep the tree size small, it is crucial to carefully decide, when expanding a tree node, which question (typically variable) to branch on at that node in order to partition the remaining space. Numerous partitioning techniques (e.g., variable selection) have been proposed, but there is no theory describing which technique is optimal. We show how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution. We provide the first sample complexity guarantees for tree search algorithm configuration. These guarantees bound the number of samples sufficient to ensure that the empirical performance of an algorithm over the samples nearly matches its expected performance on the unknown instance distribution. This thorough theoretical investigation naturally gives rise to our learning algorithm. Via experiments, we show that learning an optimal weighting of partitioning procedures can dramatically reduce tree size, and we prove that this reduction can even be exponential. Through theory and experiments, we show that learning to branch is both practical and hugely beneficial. 
    more » « less
  5. Range closest-pair (RCP) search is a range-search variant of the classical closest-pair problem, which aims to store a given set S of points into some space-efficient data structure such that when a query range Q is specified, the closest pair in S∩Q can be reported quickly. RCP search has received attention over years, but the primary focus was only on R^2. In this paper, we study RCP search in higher dimensions. We give the first nontrivial RCP data structures for orthogonal, simplex, halfspace, and ball queries in R^d for any constant d. Furthermore, we prove a conditional lower bound for orthogonal RCP search for d≥3. 
    more » « less