 Publication Date:
 NSFPAR ID:
 10275102
 Journal Name:
 NeurIPS 2021
 Sponsoring Org:
 National Science Foundation
More Like this

We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a prespecified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to nontrivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence x1, . . . , xn of length n. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some t < n and m < n − t, after seeing t observations we predict the average of x_{t+1},..., x{t+m}. This particular problem was first studied in [Dru13] and referred to as the “density prediction game”. We show that the expected squared error of our prediction can be bounded by O(1/log n) and prove a matching lower bound, which resolves an open question raised in [Dru13]. Thismore »

Obeid, I. (Ed.)The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of highresolution images from scanned pathology samples [1], as part of its National Science Foundationfunded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The longterm goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely highdimensional models (e.g., billions of parameters). Other fields that do notmore »

We present approximation and exact algorithms for piecewise regression of univariate and bivariate data using fixeddegree polynomials. Specifically, given a set S of n data points (x1, y1), . . . , (xn, yn) ∈ Rd × R where d ∈ {1, 2}, the goal is to segment xi’s into some (arbitrary) number of disjoint pieces P1, . . . , Pk, where each piece Pj is associated with a fixeddegree polynomial fj : Rd → R, to minimize the total loss function λk+ni=1(yi −f(xi))2, where λ ≥ 0 is a regularization term that penalizes model complexity (number of pieces) and f : kj=1 Pj → R is the piecewise polynomial function defined as fPj = fj. The pieces P1,...,Pk are disjoint intervals of R in the case of univariate data and disjoint axisaligned rectangles in the case of bivariate data. Our error approximation allows use of any fixeddegree polynomial, not just linear functions. Our main results are the following. For univariate data, we present a (1 + ε)approximation algorithm with time complexity O(nε log1ε), assuming that data is presented in sorted order of xi’s. For bivariate data, we √ present three results: a subexponential exact algorithm with running timemore »

Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finitesample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$dimensional nonidentical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘lownoise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.

Abstract We consider the multiclass classification problem when the training data and the outofsample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set C(x) as a subset of class labels, possibly empty. It tries to optimize the outofsample performance, aiming to include the correct class and to detect outliers x as often as possible. BCOPS returns no prediction (corresponding to C(x) equal to the empty set) if it infers x to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the outofsample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.