skip to main content


Title: Approximation Algorithms for D -optimal Design
Experimental design is a classical statistics problem, and its aim is to estimate an unknown vector from linear measurements where a Gaussian noise is introduced in each measurement. For the combinatorial experimental design problem, the goal is to pick a subset of experiments so as to make the most accurate estimate of the unknown parameters. In this paper, we will study one of the most robust measures of error estimation—the D-optimality criterion, which corresponds to minimizing the volume of the confidence ellipsoid for the estimation error. The problem gives rise to two natural variants depending on whether repetitions of experiments are allowed or not. We first propose an approximation algorithm with a 1/e-approximation for the D-optimal design problem with and without repetitions, giving the first constant-factor approximation for the problem. We then analyze another sampling approximation algorithm and prove that it is asymptotically optimal. Finally, for D-optimal design with repetitions, we study a different algorithm proposed by the literature and show that it can improve this asymptotic approximation ratio. All the sampling algorithms studied in this paper are shown to admit polynomial-time deterministic implementations.  more » « less
Award ID(s):
1910423
NSF-PAR ID:
10300810
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Mathematics of Operations Research
Volume:
45
Issue:
4
ISSN:
0364-765X
Page Range / eLocation ID:
1512 to 1534
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Experimental design is a classical area in statistics and has also found new applications in machine learning. In the combinatorial experimental design problem, the aim is to estimate an unknown m-dimensional vector x from linear measurements where a Gaussian noise is introduced in each measurement. The goal is to pick k out of the given n experiments so as to make the most accurate estimate of the unknown parameter x. Given a set S of chosen experiments, the most likelihood estimate x0 can be obtained by a least squares computation. One of the robust measures of error estimation is the D-optimality criterion which aims to minimize the generalized variance of the estimator. This corresponds to minimizing the volume of the standard confidence ellipsoid for the estimation error x − x0. The problem gives rise to two natural variants depending on whether repetitions of experiments is allowed or not. The latter variant, while being more general, has also found applications in geographical location of sensors. We show a close connection between approximation algorithms for the D-optimal design problem and constructions of approximately m-wise positively correlated distributions. This connection allows us to obtain first approximation algorithms for the D-optimal design problem with and without repetitions. We then consider the case when the number of experiments chosen is much larger than the dimension m and show one can obtain asymptotically optimal algorithms in this case. 
    more » « less
  2. We study the A-optimal design problem where we are given vectors υ1, …, υn ∊ ℝd, an integer k ≥ d, and the goal is to select a set S of k vectors that minimizes the trace of (∑i∊Svivi⊺)−1. Traditionally, the problem is an instance of optimal design of experiments in statistics [35] where each vector corresponds to a linear measurement of an unknown vector and the goal is to pick k of them that minimize the average variance of the error in the maximum likelihood estimate of the vector being measured. The problem also finds applications in sensor placement in wireless networks [22], sparse least squares regression [8], feature selection for k-means clustering [9], and matrix approximation [13, 14, 5]. In this paper, we introduce proportional volume sampling to obtain improved approximation algorithms for A-optimal design. Given a matrix, proportional volume sampling involves picking a set of columns S of size k with probability proportional to µ(S) times det(∑i∊Svivi⊺) for some measure µ. Our main result is to show the approximability of the A-optimal design problem can be reduced to approximate independence properties of the measure µ. We appeal to hardcore distributions as candidate distributions µ that allow us to obtain improved approximation algorithms for the A-optimal design. Our results include a d-approximation when k = d, an (1 + ∊)-approximation when and -approximation when repetitions of vectors are allowed in the solution. We also consider generalization of the problem for k ≤ d and obtain a k-approximation. We also show that the proportional volume sampling algorithm gives approximation algorithms for other optimal design objectives (such as D-optimal design [36] and generalized ratio objective [27]) matching or improving previous best known results. Interestingly, we show that a similar guarantee cannot be obtained for the E-optimal design problem. We also show that the A-optimal design problem is NP-hard to approximate within a fixed constant when k = d. 
    more » « less
  3. We consider the problem of designing sublinear time algorithms for estimating the cost of minimum] metric traveling salesman (TSP) tour. Specifically, given access to a n × n distance matrix D that specifies pairwise distances between n points, the goal is to estimate the TSP cost by performing only sublinear (in the size of D) queries. For the closely related problem of estimating the weight of a metric minimum spanning tree (MST), it is known that for any epsilon > 0, there exists an O^~(n/epsilon^O(1))-time algorithm that returns a (1+epsilon)-approximate estimate of the MST cost. This result immediately implies an O^~(n/epsilon^O(1)) time algorithm to estimate the TSP cost to within a (2 + epsilon) factor for any epsilon > 0. However, no o(n^2)-time algorithms are known to approximate metric TSP to a factor that is strictly better than 2. On the other hand, there were also no known barriers that rule out existence of (1 + epsilon)-approximate estimation algorithms for metric TSP with O^~ (n) time for any fixed epsilon > 0. In this paper, we make progress on both algorithms and lower bounds for estimating metric TSP cost. On the algorithmic side, we first consider the graphic TSP problem where the metric D corresponds to shortest path distances in a connected unweighted undirected graph. We show that there exists an O^~(n) time algorithm that estimates the cost of graphic TSP to within a factor of (2 − epsilon_0) for some epsilon_0 > 0. This is the first sublinear cost estimation algorithm for graphic TSP that achieves an approximation factor less than 2. We also consider another well-studied special case of metric TSP, namely, (1, 2)-TSP where all distances are either 1 or 2, and give an O^~(n ^ 1.5) time algorithm to estimate optimal cost to within a factor of 1.625. Our estimation algorithms for graphic TSP as well as for (1, 2)-TSP naturally lend themselves to O^~(n) space streaming algorithms that give an 11/6-approximation for graphic TSP and a 1.625-approximation for (1, 2)-TSP. These results motivate the natural question if analogously to metric MST, for any epsilon > 0, (1 + epsilon)-approximate estimates can be obtained for graphic TSP and (1, 2)-TSP using O^~ (n) queries. We answer this question in the negative – there exists an epsilon_0 > 0, such that any algorithm that estimates the cost of graphic TSP ((1, 2)-TSP) to within a (1 + epsilon_0)-factor, necessarily requires (n^2) queries. This lower bound result highlights a sharp separation between the metric MST and metric TSP problems. Similarly to many classical approximation algorithms for TSP, our sublinear time estimation algorithms utilize subroutines for estimating the size of a maximum matching in the underlying graph. We show that this is not merely an artifact of our approach, and that for any epsilon > 0, any algorithm that estimates the cost of graphic TSP or (1, 2)-TSP to within a (1 + epsilon)-factor, can also be used to estimate the size of a maximum matching in a bipartite graph to within an epsilon n additive error. This connection allows us to translate known lower bounds for matching size estimation in various models to similar lower bounds for metric TSP cost estimation. 
    more » « less
  4. We study optimal design problems in which the goal is to choose a set of linear measurements to obtain the most accurate estimate of an unknown vector. We study the [Formula: see text]-optimal design variant where the objective is to minimize the average variance of the error in the maximum likelihood estimate of the vector being measured. We introduce the proportional volume sampling algorithm to obtain nearly optimal bounds in the asymptotic regime when the number [Formula: see text] of measurements made is significantly larger than the dimension [Formula: see text] and obtain the first approximation algorithms whose approximation factor does not degrade with the number of possible measurements when [Formula: see text] is small. The algorithm also gives approximation guarantees for other optimal design objectives such as [Formula: see text]-optimality and the generalized ratio objective, matching or improving the previously best-known results. We further show that bounds similar to ours cannot be obtained for [Formula: see text]-optimal design and that [Formula: see text]-optimal design is NP-hard to approximate within a fixed constant when [Formula: see text]. 
    more » « less
  5. Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are “inconsistent” with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either mapping the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the top-K metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the “blind spot” issue, i.e., estimation accuracy to recover the top-K metrics when K is small can still be rather substantial. In this paper, we provide an in-depth investigation into these problems and make two innovative contributions. First, we propose a new item-sampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlights its subtle difference against prior work. Second, we propose a new adaptive sampling method that aims to deal with the “blind spot” problem and also demonstrate the expectation-maximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation and provides strong evidence for making item sampling a powerful and reliable tool for recommendation evaluation. 
    more » « less