The feedback arc set problem is one of the most fundamental and well-studied ranking problems where n objects are to be ordered based on their pairwise comparison. The problem enjoys several efficient approximation algorithms in the offline setting. Unfortunately, online there are strong lower bounds on the competitive ratio establishing that no algorithm can perform well in the worst case.This paper introduces a new beyond-worst-case model for online feedback arc set. In the model, a sample of the input is given to the algorithm offline before the remaining instance is revealed online. This models the case in practice where yesterday's data is available and is similar to today's online instance. This sample is drawn from a known distribution which may not be uniform. We design an online algorithm with strong theoretical guarantees. The algorithm has a small constant competitive ratio when the sample is uniform---if not, we show we can recover the same result by adding a provably minimal sample. Empirical results validate the theory and show that such algorithms can be used on temporal data to obtain strong results.
more »
« less
Worst-Case Analysis for Randomly Collected Data
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x1,...,xn. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worst-case is with respect to the data values x1,...,xn. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–-where the weights are functions of the distribution P and the sample and target sets A, B--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (xi,yi). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".
more »
« less
- PAR ID:
- 10275102
- Date Published:
- Journal Name:
- NeurIPS 2021
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for statistical aspects of the data and does not give a way to reason about points outside the data set. To address these limitations, we propose a novel framework -- distributional Shapley -- where the value of a point is defined in the context of an underlying data distribution. We prove that distributional Shapley has several desirable statistical properties; for example, the values are stable under perturbations to the data points themselves and to the underlying data distribution. We leverage these properties to develop a new algorithm for estimating values from data, which comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms for computing the (non-distributional) data Shapley values. We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.more » « less
-
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets.more » « less
-
While widely used as a general method for uncertainty quantification, the bootstrap method encounters difficulties that raise concerns about its validity in practical applications. This paper introduces a new resampling-based method, termed calibrated bootstrap, designed to generate finite sample-valid parametric inference from a sample of size n. The central idea is to calibrate an m-out-of-n resampling scheme, where the calibration parameter m is determined against inferential pivotal quantities derived from the cumulative distribution functions of loss functions in parameter estimation. The method comprises two algorithms. The first, named resampling approximation (RA), employs a stochastic approximation algorithm to find the value of the calibration parameter m=mα for a given α in a manner that ensures the resulting m-out-of-n bootstrapped 1−α confidence set is valid. The second algorithm, termed distributional resampling (DR), is developed to further select samples of bootstrapped estimates from the RA step when constructing 1−α confidence sets for a range of α values is of interest. The proposed method is illustrated and compared to existing methods using linear regression with and without L1 penalty, within the context of a high-dimensional setting and a real-world data application. The paper concludes with remarks on a few open problems worthy of consideration.more » « less
-
We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence x1, . . . , xn of length n. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some t < n and m < n − t, after seeing t observations we predict the average of x_{t+1},..., x{t+m}. This particular problem was first studied in [Dru13] and referred to as the “density prediction game”. We show that the expected squared error of our prediction can be bounded by O(1/log n) and prove a matching lower bound, which resolves an open question raised in [Dru13]. This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.more » « less
An official website of the United States government

