We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a prespecified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to nontrivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence x1, . . . , xn of length n. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some t < n and m < n − t, after seeing t observations we predict the average of x_{t+1},..., x{t+m}. This particular problem was first studied in [Dru13] and referred to as the “density prediction game”. We show that the expected squared error of our prediction can be bounded by O(1/log n) and prove a matching lower bound, which resolves an open question raised in [Dru13]. Thismore »
WorstCase Analysis for Randomly Collected Data
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x1,...,xn. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worstcase, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worstcase is with respect to the data values x1,...,xn. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values–where the weights are functions of the distribution P and the sample and target sets A, Band show that the worstcase expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. more »
 Publication Date:
 NSFPAR ID:
 10275102
 Journal Name:
 NeurIPS 2021
 Sponsoring Org:
 National Science Foundation
More Like this


Obeid, I. (Ed.)The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of highresolution images from scanned pathology samples [1], as part of its National Science Foundationfunded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The longterm goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely highdimensional models (e.g., billions of parameters). Other fields that do notmore »

We present approximation and exact algorithms for piecewise regression of univariate and bivariate data using fixeddegree polynomials. Specifically, given a set S of n data points (x1, y1), . . . , (xn, yn) ∈ Rd × R where d ∈ {1, 2}, the goal is to segment xi’s into some (arbitrary) number of disjoint pieces P1, . . . , Pk, where each piece Pj is associated with a fixeddegree polynomial fj : Rd → R, to minimize the total loss function λk+ni=1(yi −f(xi))2, where λ ≥ 0 is a regularization term that penalizes model complexity (number of pieces) and f : kj=1 Pj → R is the piecewise polynomial function defined as fPj = fj. The pieces P1,...,Pk are disjoint intervals of R in the case of univariate data and disjoint axisaligned rectangles in the case of bivariate data. Our error approximation allows use of any fixeddegree polynomial, not just linear functions. Our main results are the following. For univariate data, we present a (1 + ε)approximation algorithm with time complexity O(nε log1ε), assuming that data is presented in sorted order of xi’s. For bivariate data, we √ present three results: a subexponential exact algorithm with running timemore »

Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finitesample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$dimensional nonidentical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘lownoise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.

We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worstcase. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomialtime algorithms in the nonrealizable setting for expressive classes of networks (e.g. feedforward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficientlystrong polynomial eigenvalue decay, we obtain {\em fully}polynomial time algorithms in {\em all} the relevant parameters with respect to squareloss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomialtime algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets.