skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Influence of Dimensions on the Complexity of Computing Decision Trees
A decision tree recursively splits a feature space Rd and then assigns class labels based on the resulting partition. Decision trees have been part of the basic machine- learning toolkit for decades. A large body of work treats heuristic algorithms to compute a decision tree from training data, usually aiming to minimize in particular the size of the resulting tree. In contrast, little is known about the complexity of the underlying computational problem of computing a minimum-size tree for the given training data. We study this problem with respect to the number d of dimensions of the feature space. We show that it can be solved in O(n2d+1d) time, but under reasonable complexity-theoretic assumptions it is not possible to achieve f (d) · no(d/ log d) running time, where n is the number of training examples. The problem is solvable in (dR)O(dR) · n1+o(1) time, if there are exactly two classes and R is an upper bound on the number of tree leaves labeled with the first class.  more » « less
Award ID(s):
2212130
PAR ID:
10420446
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
37th AAAI Conference on Artificial Intelligence (AAAI)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data augmentation is widely used for training a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment randomly sample from a list of pre-selected transformations, while methods such as AutoAugment apply advanced search to optimize over an augmentation set of size $k^d$, which is the number of transformation sequences of length $$d$$, given a list of $$k$$ transformations. In this paper, we design efficient algorithms whose running time complexity is much faster than the worst-case complexity of $O(k^d)$, provably. We propose a new algorithm to search for a binary tree-structured composition of $$k$$ transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the SimCLR augmentation scheme for contrastive learning. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of $O(2^d k)$, which is much faster than $O(k^d)$ as $$k$$ increases above $$2$$. We apply our algorithm to tackle data distributions with heterogeneous subpopulations by searching for one tree in each subpopulation and then learning a weighted combination, resulting in a \emph{forest} of trees. We validate our proposed algorithms on numerous graph and image datasets, including a multi-label graph classification dataset we collected. The dataset exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost by 43% over existing search methods while improving performance by 4.3%. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs. 
    more » « less
  2. Kumar, Amit; Ron-Zewi, Noga (Ed.)
    Estimating the size of the union of a stream of sets S₁, S₂, …, S_M where each set is a subset of a known universe Ω is a fundamental problem in data streaming. This problem naturally generalizes the well-studied 𝖥₀ estimation problem in the streaming literature, where each set contains a single element from the universe. We consider the general case when the sets S_i can be succinctly represented and allow efficient membership, cardinality, and sampling queries (called a Delphic family of sets). A notable example in this framework is the Klee’s Measure Problem (KMP), where every set S_i is an axis-parallel rectangle in d-dimensional spaces (Ω = [Δ]^d where [Δ] := {1, … ,Δ} and Δ ∈ ℕ). Recently, Meel, Chakraborty, and Vinodchandran (PODS-21, PODS-22) designed a streaming algorithm for (ε,δ)-estimation of the size of the union of set streams over Delphic family with space and update time complexity O((log³|Ω|)/ε² ⋅ log 1/δ) and Õ((log⁴|Ω|)/ε² ⋅ log 1/(δ)), respectively. This work presents a new, sampling-based algorithm for estimating the size of the union of Delphic sets that has space and update time complexity Õ((log²|Ω|)/ε² ⋅ log 1/(δ)). This improves the space complexity bound by a log|Ω| factor and update time complexity bound by a log² |Ω| factor. A critical question is whether quadratic dependence of log|Ω| on space and update time complexities is necessary. Specifically, can we design a streaming algorithm for estimating the size of the union of sets over Delphic family with space and complexity linear in log|Ω| and update time poly(log|Ω|)? While this appears technically challenging, we show that establishing a lower bound of ω(log|Ω|) with poly(log|Ω|) update time is beyond the reach of current techniques. Specifically, we show that under certain hard-to-prove computational complexity hypothesis, there is a streaming algorithm for the problem with optimal space complexity O(log|Ω|) and update time poly(log(|Ω|)). Thus, establishing a space lower bound of ω(log|Ω|) will lead to break-through complexity class separation results. 
    more » « less
  3. Query-to-communication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated `lifted' function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. Several important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, from its current near-linear size down to polylogarithmic in the number of inputs N of the original function or, ideally, constant. The near-linear size bound was shown by Lovett, Meka, Mertz, Pitassi and Zhang using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with the Index function of near-linear size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; 1) The conjecture of Lovett et al. is false when the size of the Index gadget is logN−\omega(1). 2) Also, the Inner-Product function, which satisfies the disperser property at size O(logN), does not have this property when its size is log N−\omega(1). 3) Nonetheless, using Index gadgets of size at least 4, we prove a lifting theorem for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. 4) Using the ideas from this lifting theorem, we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this to derive a general lifting theorem in proof complexity from tree-resolution size to tree-like Res(\oplus) refutation size, which yields many new exponential lower bounds on such proofs. 
    more » « less
  4. Query-to-communication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated ‘lifted’ function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. A number of important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, which is a universal gadget for lifting, from its current near-linear size down to polylogarithmic in the number of inputs N of the original function or, ideally, constant. The near-linear size bound was recently shown by Lovett, Meka, Mertz, Pitassi and Zhang using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with an Index function of that size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; • The conjecture of Lovett et al. is false when the size of the Index gadget is less than logarithmic in N. • The same limitation applies to the Inner-Product function. More precisely, the Inner-Product function, which is known to satisfy the disperser property at size O(log N), also does not have this property when its size is less than log N. • Notwithstanding the above, we prove a lifting theorem that applies to Index gadgets of any size at least 4 and yields lower bounds for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. • Using a modification of the same idea with improved lifting parameters we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this, in turn, to derive a general lifting theorem in proof complexity from tree-resolution size to tree-like Res(⊕) refutation size, which yields many new exponential lower bounds on such proofs. 
    more » « less
  5. Query-to-communication lifting theorems, which connect the query complexity of a Boolean function to the communication complexity of an associated “lifted” function obtained by composing the function with many copies of another function known as a gadget, have been instrumental in resolving many open questions in computational complexity. A number of important complexity questions could be resolved if we could make substantial improvements in the input size required for lifting with the Index function, which is a universal gadget for lifting, from its current near-linear size down to polylogarithmic in the number of inputs N of the original function or, ideally, constant. The near-linear size bound was recently shown by Lovett, Meka, Mertz, Pitassi and Zhang [20] using a recent breakthrough improvement on the Sunflower Lemma to show that a certain graph associated with an Index function of that size is a disperser. They also stated a conjecture about the Index function that is essential for further improvements in the size required for lifting with Index using current techniques. In this paper we prove the following; - The conjecture of Lovett et al. is false when the size of the Index gadget is less than logarithmic in N . - The same limitation applies to the Inner-Product function. More precisely, the Inner-Product function, which is known to satisfy the disperser property at size O(log N ), also does not have this property when its size is less than log N . - Notwithstanding the above, we prove a lifting theorem that applies to Index gadgets of any size at least 4 and yields lower bounds for a restricted class of communication protocols in which one of the players is limited to sending parities of its inputs. - Using a modification of the same idea with improved lifting parameters we derive a strong lifting theorem from decision tree size to parity decision tree size. We use this, in turn, to derive a general lifting theorem in proof complexity from tree-resolution size to tree-like Res(⊕) refutation size, which yields many new exponential lower bounds on such proofs. 
    more » « less