Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Larochelle, Hugo ; Hadsell, Raia ; Cho, Kyunghyun (Ed.)In deep learning, leveraging transfer learning has recently been shown to be an effective strategy for training large high performance models with Differential Privacy (DP). Moreover, somewhat surprisingly, recent works have found that privately training just the last layer of a pretrained model provides the best utility with DP. While past studies largely rely on using firstorder differentially private training algorithms like DPSGD for training large models, in the specific case of privately learning from features, we observe that computational burden is often low enough to allow for more sophisticated optimization schemes, including secondorder methods. To that end, we systematically explore the effect of design parameters such as loss function and optimization algorithm. We find that, while commonly used logistic regression performs better than linear regression in the nonprivate setting, the situation is reversed in the private setting. We find that leastsquares linear regression is much more effective than logistic regression from both privacy and computational standpoint, especially at stricter epsilon values (ε < 1). On the optimization side, we also explore using Newton’s method, and find that secondorder information is quite helpful even with privacy, although the benefit significantly diminishes with stricter privacy guarantees. While both methods use secondorder information, least squares is more effective at lower epsilon values while Newton’s method is more effective at larger epsilon values. To combine the benefits of both methods, we propose a novel optimization algorithm called DPFC, which leverages feature covariance instead of the Hessian of the logistic regression loss and performs well across all ε values we tried. With this, we obtain new SOTA results on ImageNet1k, CIFAR100 and CIFAR10 across all values of ε typically considered. Most remarkably, on ImageNet1K, we obtain top1 accuracy of 88% under DP guarantee of (8, 8 ∗ 10−7) and 84.3% under (0.1, 8 ∗ 10−7).more » « less

Larochelle, Hugo ; Kamath, Gautam ; Hadsell, Raia ; Cho, Kyunghyun (Ed.)Neural scene representations, both continuous and discrete, have recently emerged as a powerful new paradigm for 3D scene understanding. Recent efforts have tackled unsupervised discovery of objectcentric neural scene representations. However, the high cost of raymarching, exacerbated by the fact that each object representation has to be raymarched separately, leads to insufficiently sampled radiance fields and thus, noisy renderings, poor framerates, and high memory and time complexity during training and rendering. Here, we propose to represent objects in an objectcentric, compositional scene representation as light fields. We propose a novel light field compositor module that enables reconstructing the global light field from a set of objectcentric light fields. Dubbed Compositional Object Light Fields (COLF), our method enables unsupervised learning of objectcentric neural scene representations, stateoftheart reconstruction and novel view synthesis performance on standard datasets, and rendering and training speeds at orders of magnitude faster than existing 3D approaches.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)Transfer operators provide a rich framework for representing the dynamics of very general, nonlinear dynamical systems. When interacting with reproducing kernel Hilbert spaces (RKHS), descriptions of dynamics often incur prohibitive data storage requirements, motivating dataset sparsification as a precursory step to computation. Further, in practice, data is available in the form of trajectories, introducing correlation between samples. In this work, we present a method for sparse learning of transfer operators from $\beta$mixing stochastic processes, in both discrete and continuous time, and provide sample complexity analysis extending existing theoretical guarantees for learning from nonsparse, i.i.d. data. In addressing continuoustime settings, we develop precise descriptions using covariancetype operators for the infinitesimal generator that aids in the sample complexity analysis. We empirically illustrate the efficacy of our sparse embedding approach through deterministic and stochastic nonlinear system examples.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved stateoftheart performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 2050ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with retraining the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in realtime. Our code for IncDSI is available at \href{https://github.com/varshakishore/IncDSI}{https://github.com/varshakishore/IncDSI}.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)We consider the problem of estimating the optimal transport map between two probability distributions, P and Q in R^d, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution Q is a discrete measure supported on a finite number of points in R^d. We study a computationally efficient estimator initially proposed by Pooladian & NilesWeed (2021), based on entropic optimal transport, and show in the semidiscrete setting that it converges at the minimaxoptimal rate n^{−1/2}, independent of dimension. Other standard map estimation techniques both lack finitesample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan. (Ed.)The parameter space for any fixed architecture of feedforward ReLU neural networks serves as a proxy during training for the associated class of functions  but how faithful is this representation? It is known that many different parameter settings $\theta$ can determine the same function $f$. Moreover, the degree of this redundancy is inhomogeneous: for some networks, the only symmetries are permutation of neurons in a layer and positive scaling of parameters at a neuron, while other networks admit additional hidden symmetries. In this work, we prove that, for any network architecture where no layer is narrower than the input, there exist parameter settings with no hidden symmetries. We also describe a number of mechanisms through which hidden symmetries can arise, and empirically approximate the functional dimension of different network architectures at initialization. These experiments indicate that the probability that a network has no hidden symmetries decreases towards 0 as depth increases, while increasing towards 1 as width and input dimension increase.more » « less

Krause, Andreas ; Brunskill, Emma_Patricia ; Cho, Kyunghyun ; Engelhardt, Barbara_Elizabeth ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)Realworld data can be multimodal distributed, e.g., data describing the opinion divergence in a community, the interspike interval distribution of neurons, and the oscillators' natural frequencies. Generating multimodal distributed realworld data has become a challenge to existing generative adversarial networks (GANs). For example, it is often observed that Neural SDEs have only demonstrated successful performance mainly in generating unimodal time series datasets. In this paper, we propose a novel time series generator, named directed chain GANs (DCGANs), which inserts a time series dataset (called a neighborhood process of the directed chain or input) into the drift and diffusion coefficients of the directed chain SDEs with distributional constraints. DCGANs can generate new time series of the same distribution as the neighborhood process, and the neighborhood process will provide the key step in learning and generating multimodal distributed time series. The proposed DCGANs are examined on four datasets, including two stochastic models from social sciences and computational neuroscience, and two realworld datasets on stock prices and energy consumption. To our best knowledge, DCGANs are the first work that can generate multimodal time series data and consistently outperforms stateoftheart benchmarks with respect to measures of distribution, data similarity, and predictive ability.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt ; Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)Robust Markov decision processes (MDPs) address the challenge of model uncertainty by optimizing the worstcase performance over an uncertainty set of MDPs. In this paper, we focus on the robust averagereward MDPs under the modelfree setting. We first theoretically characterize the structure of solutions to the robust averagereward Bellman equation, which is essential for our later convergence analysis. We then design two modelfree algorithms, robust relative value iteration (RVI) TD and robust RVI Qlearning, and theoretically prove their convergence to the optimal solution. We provide several widely used uncertainty sets as examples, including those def ined by the contamination model, total variation, Chisquared divergence, KullbackLeibler (KL) divergence and Wasserstein distance.more » « less

Agarwal, Alekh ; Belgrave, Danielle ; Cho, Kyunghyun ; Oh, Alice (Ed.)We propose a new approach to automated theorem proving where an AlphaZerostyle agent is selftraining to refine a generic highlevel expert strategy expressed as a nondeterministic program. An analogous teacher agent is selftraining to generate tasks of suitable relevance and difficulty for the learner. This allows leveraging minimal amounts of domain knowledge to tackle problems for which training data is unavailable or hard to synthesize. As a specific illustration, we consider loop invariant synthesis for imperative programs and use neural networks to refine both the teacher and solver strategies.more » « less

Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)We consider a deep matrix factorization model of covariance matrices trained with the BuresWasserstein distance. While recent works have made advances in the study of the optimization problem for overparametrized lowrank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another type of loss and connects with the generative setting. We characterize the critical points and minimizers of the BuresWasserstein distance over the space of rankbounded matrices. The Hessian of this loss at lowrank matrices can theoretically blow up, which creates challenges to analyze convergence of gradient optimization methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss as well as convergence results for finite step size gradient descent under certain assumptions on the initial weights.more » « less