Experimental design is a classical area in statistics and has also found new applications in machine learning. In the combinatorial experimental design problem, the aim is to estimate an unknown mdimensional vector x from linear measurements where a Gaussian noise is introduced in each measurement. The goal is to pick k out of the given n experiments so as to make the most accurate estimate of the unknown parameter x. Given a set S of chosen
experiments, the most likelihood estimate x0 can be obtained by a least squares computation. One of the robust measures of error estimation is the Doptimality criterion which aims to minimize the generalized variance of the estimator. This corresponds to minimizing the volume of the standard confidence ellipsoid for the estimation error x − x0. The problem gives rise to two natural variants depending on whether repetitions of experiments is allowed or not. The latter variant, while being more general, has also found
applications in geographical location of sensors.
We show a close connection between approximation algorithms for the Doptimal design problem and constructions of approximately mwise positively correlated distributions.
This connection allows us to obtain first approximation algorithms for the Doptimal design problem with and without repetitions. We then consider the case when the number of experiments chosen is much larger than the dimension m and show one can obtain asymptotically optimal algorithms in this case.
more »
« less
Approximation Algorithms for D optimal Design
Experimental design is a classical statistics problem, and its aim is to estimate an unknown vector from linear measurements where a Gaussian noise is introduced in each measurement. For the combinatorial experimental design problem, the goal is to pick a subset of experiments so as to make the most accurate estimate of the unknown parameters. In this paper, we will study one of the most robust measures of error estimation—the Doptimality criterion, which corresponds to minimizing the volume of the confidence ellipsoid for the estimation error. The problem gives rise to two natural variants depending on whether repetitions of experiments are allowed or not. We first propose an approximation algorithm with a 1/eapproximation for the Doptimal design problem with and without repetitions, giving the first constantfactor approximation for the problem. We then analyze another sampling approximation algorithm and prove that it is asymptotically optimal. Finally, for Doptimal design with repetitions, we study a different algorithm proposed by the literature and show that it can improve this asymptotic approximation ratio. All the sampling algorithms studied in this paper are shown to admit polynomialtime deterministic implementations.
more »
« less
 Award ID(s):
 1910423
 NSFPAR ID:
 10300810
 Date Published:
 Journal Name:
 Mathematics of Operations Research
 Volume:
 45
 Issue:
 4
 ISSN:
 0364765X
 Page Range / eLocation ID:
 1512 to 1534
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study the Aoptimal design problem where we are given vectors υ1, …, υn ∊ ℝd, an integer k ≥ d, and the goal is to select a set S of k vectors that minimizes the trace of (∑i∊Svivi⊺)−1. Traditionally, the problem is an instance of optimal design of experiments in statistics [35] where each vector corresponds to a linear measurement of an unknown vector and the goal is to pick k of them that minimize the average variance of the error in the maximum likelihood estimate of the vector being measured. The problem also finds applications in sensor placement in wireless networks [22], sparse least squares regression [8], feature selection for kmeans clustering [9], and matrix approximation [13, 14, 5]. In this paper, we introduce proportional volume sampling to obtain improved approximation algorithms for Aoptimal design. Given a matrix, proportional volume sampling involves picking a set of columns S of size k with probability proportional to µ(S) times det(∑i∊Svivi⊺) for some measure µ. Our main result is to show the approximability of the Aoptimal design problem can be reduced to approximate independence properties of the measure µ. We appeal to hardcore distributions as candidate distributions µ that allow us to obtain improved approximation algorithms for the Aoptimal design. Our results include a dapproximation when k = d, an (1 + ∊)approximation when and approximation when repetitions of vectors are allowed in the solution. We also consider generalization of the problem for k ≤ d and obtain a kapproximation. We also show that the proportional volume sampling algorithm gives approximation algorithms for other optimal design objectives (such as Doptimal design [36] and generalized ratio objective [27]) matching or improving previous best known results. Interestingly, we show that a similar guarantee cannot be obtained for the Eoptimal design problem. We also show that the Aoptimal design problem is NPhard to approximate within a fixed constant when k = d.more » « less

We consider the problem of designing sublinear time algorithms for estimating the cost of minimum] metric traveling salesman (TSP) tour. Specifically, given access to a n × n distance matrix D that specifies pairwise distances between n points, the goal is to estimate the TSP cost by performing only sublinear (in the size of D) queries. For the closely related problem of estimating the weight of a metric minimum spanning tree (MST), it is known that for any epsilon > 0, there exists an O^~(n/epsilon^O(1))time algorithm that returns a (1+epsilon)approximate estimate of the MST cost. This result immediately implies an O^~(n/epsilon^O(1)) time algorithm to estimate the TSP cost to within a (2 + epsilon) factor for any epsilon > 0. However, no o(n^2)time algorithms are known to approximate metric TSP to a factor that is strictly better than 2. On the other hand, there were also no known barriers that rule out existence of (1 + epsilon)approximate estimation algorithms for metric TSP with O^~ (n) time for any fixed epsilon > 0. In this paper, we make progress on both algorithms and lower bounds for estimating metric TSP cost. On the algorithmic side, we first consider the graphic TSP problem where the metric D corresponds to shortest path distances in a connected unweighted undirected graph. We show that there exists an O^~(n) time algorithm that estimates the cost of graphic TSP to within a factor of (2 − epsilon_0) for some epsilon_0 > 0. This is the first sublinear cost estimation algorithm for graphic TSP that achieves an approximation factor less than 2. We also consider another wellstudied special case of metric TSP, namely, (1, 2)TSP where all distances are either 1 or 2, and give an O^~(n ^ 1.5) time algorithm to estimate optimal cost to within a factor of 1.625. Our estimation algorithms for graphic TSP as well as for (1, 2)TSP naturally lend themselves to O^~(n) space streaming algorithms that give an 11/6approximation for graphic TSP and a 1.625approximation for (1, 2)TSP. These results motivate the natural question if analogously to metric MST, for any epsilon > 0, (1 + epsilon)approximate estimates can be obtained for graphic TSP and (1, 2)TSP using O^~ (n) queries. We answer this question in the negative – there exists an epsilon_0 > 0, such that any algorithm that estimates the cost of graphic TSP ((1, 2)TSP) to within a (1 + epsilon_0)factor, necessarily requires (n^2) queries. This lower bound result highlights a sharp separation between the metric MST and metric TSP problems. Similarly to many classical approximation algorithms for TSP, our sublinear time estimation algorithms utilize subroutines for estimating the size of a maximum matching in the underlying graph. We show that this is not merely an artifact of our approach, and that for any epsilon > 0, any algorithm that estimates the cost of graphic TSP or (1, 2)TSP to within a (1 + epsilon)factor, can also be used to estimate the size of a maximum matching in a bipartite graph to within an epsilon n additive error. This connection allows us to translate known lower bounds for matching size estimation in various models to similar lower bounds for metric TSP cost estimation.more » « less

We study optimal design problems in which the goal is to choose a set of linear measurements to obtain the most accurate estimate of an unknown vector. We study the [Formula: see text]optimal design variant where the objective is to minimize the average variance of the error in the maximum likelihood estimate of the vector being measured. We introduce the proportional volume sampling algorithm to obtain nearly optimal bounds in the asymptotic regime when the number [Formula: see text] of measurements made is significantly larger than the dimension [Formula: see text] and obtain the first approximation algorithms whose approximation factor does not degrade with the number of possible measurements when [Formula: see text] is small. The algorithm also gives approximation guarantees for other optimal design objectives such as [Formula: see text]optimality and the generalized ratio objective, matching or improving the previously bestknown results. We further show that bounds similar to ours cannot be obtained for [Formula: see text]optimal design and that [Formula: see text]optimal design is NPhard to approximate within a fixed constant when [Formula: see text].more » « less

Since Rendle and Krichene argued that commonly used samplingbased evaluation metrics are “inconsistent” with respect to the global metrics (even in expectation), there have been a few studies on the samplingbased recommender system evaluation. Existing methods try either mapping the samplingbased metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the topK metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the “blind spot” issue, i.e., estimation accuracy to recover the topK metrics when K is small can still be rather substantial. In this paper, we provide an indepth investigation into these problems and make two innovative contributions. First, we propose a new itemsampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlights its subtle difference against prior work. Second, we propose a new adaptive sampling method that aims to deal with the “blind spot” problem and also demonstrate the expectationmaximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation and provides strong evidence for making item sampling a powerful and reliable tool for recommendation evaluation.more » « less