skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Proximity Search in the Greedy Tree
Over the last 50 years, there have been many data structures proposed to perform proximity search problems on metric data. Perhaps the simplest of these is the ball tree, which was independently discovered multiple times over the years. However, there is a lack of strong theoretical guarantees for standard ball trees, often leading to more complicated structures when guarantees are required. In this paper, we present the greedy tree, a simple ball tree construction for which we can prove strong theoretical guarantees for proximity search queries, matching the state of the art under reasonable assumptions. To our knowledge, this is the first ball tree construction providing such guarantees. Like a standard ball tree, it is a binary tree with the points stored in the leaves. Only a point, a radius, and an integer are stored for each node. The asymptotic running times of search algorithms in the greedy tree match those of more complicated structures regularly used in practice.  more » « less
Award ID(s):
2017980
PAR ID:
10422647
Author(s) / Creator(s):
; ; ;
Editor(s):
Telikepalli Kavitha; Kurt Mehlhorn
Date Published:
Journal Name:
SIAM Symposium on Simplicity in Algorithms
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and non-convex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms come with a variety of tunable parameters that are notoriously challenging to tune by hand. A growing body of research has demonstrated the power of using a data-driven approach to automatically optimize the parameters of tree search algorithms. These techniques use atraining setof integer programs sampled from an application-specific instance distribution to find a parameter setting that has strong average performance over the training set. However, with too few samples, a parameter setting may have strong average performance on the training set but poor expected performance on future integer programs from the same application. Our main contribution is to provide the firstsample complexity guaranteesfor tree search parameter tuning. These guarantees bound the number of samples sufficient to ensure that the average performance of tree search over the samples nearly matches its future expected performance on the unknown instance distribution. In particular, the parameters we analyze weightscoring rulesused for variable selection. Proving these guarantees is challenging because tree size is a volatile function of these parameters: we prove that, for any discretization (uniform or not) of the parameter space, there exists a distribution over integer programs such that every parameter setting in the discretization results in a tree with exponential expected size, yet there exist parameter settings between the discretized points that result in trees of constant size. In addition, we provide data-dependent guarantees that depend on the volatility of these tree-size functions: our guarantees improve if the tree-size functions can be well approximated by simpler functions. Finally, via experiments, we illustrate that learning an optimal weighting of scoring rules reduces tree size. 
    more » « less
  2. null (Ed.)
    Abstract Directed acyclic graphical models are widely used to represent complex causal systems. Since the basic task of learning such a model from data is NP-hard, a standard approach is greedy search over the space of directed acyclic graphs or Markov equivalence classes of directed acyclic graphs. As the space of directed acyclic graphs on p nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation-based greedy searches. Here, we provide the first consistency guarantees, both uniform and high-dimensional, of a greedy permutation-based search. This search corresponds to a simplex-like algorithm operating over the edge-graph of a subpolytope of the permutohedron, called a directed acyclic graph associahedron. Every vertex in this polytope is associated with a directed acyclic graph, and hence with a collection of permutations that are consistent with the directed acyclic graph ordering. A walk is performed on the edges of the polytope maximizing the sparsity of the associated directed acyclic graphs. We show via simulated and real data that this permutation search is competitive with current approaches. 
    more » « less
  3. The kd-tree is one of the most widely used data structures to manage multi-dimensional data. Due to the ever-growing data volume, it is imperative to consider parallelism in kd-trees. However, we observed challenges in existing parallel kd-tree implementations, for both constructions and updates. The goal of this paper is to develop efficient in-memory kd-trees by supporting high parallelism and cache-efficiency. We propose the Pkd-tree (Parallel kd-tree), a parallel kd-tree that is efficient both in theory and in practice. The Pkd-tree supports parallel tree construction, batch update (insertion and deletion), and various queries including k-nearest neighbor search, range query, and range count. We proved that our algorithms have strong theoretical bounds in work (sequential time complexity), span (parallelism), and cache complexity. Our key techniques include 1) an efficient construction algorithm that optimizes work, span, and cache complexity simultaneously, and 2) reconstruction-based update algorithms that guarantee the tree to be weight-balanced. With the new algorithmic insights and careful engineering effort, we achieved a highly optimized implementation of the Pkd-tree. We tested Pkd-tree with various synthetic and real-world datasets, including both uniform and highly skewed data. We compare the Pkd-tree with state-of-the-art parallel kd-tree implementations. In all tests, with better or competitive query performance, Pkd-tree is much faster in construction and updates consistently than all baselines. We released our code. 
    more » « less
  4. Measuring corticosterone in feathers allows researchers to make long-term, retrospective assessments of physiology with non-invasive sampling. To date, there is little evidence that steroids degrade within the feather matrix, however this has yet to be determined from the same sample over many years. In 2009, we made a pool of European starling (Sturnus vulgaris) feathers that had been ground to a homogenous powder using a ball mill and stored on a laboratory bench. Over the past 14 years, a subset of this pooled sample has been assayed via radioimmunoassay (RIA) 19 times to quantify corticosterone. Despite high variability across time (though low variability within assays), there was no effect of time on measured feather corticosterone concentration. In contrast, two enzyme immunoassays (EIA) produced higher concentrations than the samples assayed with RIA, though this difference is likely due to different binding affinities of the antibodies used. The present study provides further support for researchers to use specimens stored long-term and from museums for feather corticosterone quantification, and likely applies to corticosteroid measurements in other keratinized tissues. 
    more » « less
  5. A growing line of work shows how learned predictions can be used to break through worst-cast barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In the problem, n items arrive over time and must be stored in sorted order in an array of size Θ(n). The array slot of an element is its label and the goal is to maintain sorted order while minimizing the total number of elements moved (i.e., relabeled). We design a new list labeling data structure and bound its performance in two models. In the worst-case learning-augmented model, we give guarantees in terms of the error in the predictions. Our data structure provides strong theoretical guarantees— it is optimal for any prediction error and guarantees the best-known worst-case bound even when the predictions are entirely erroneous. We also consider a stochastic error model and bound the performance in terms of the expectation and variance of the error. Finally, the theoretical results are demonstrated empirically. In particular, we show that our data structure performs well on numerous real datasets, including temporal datasets where predictions are constructed from elements that arrived in the past (as is typically done in a practical use case). 
    more » « less