skip to main content

Search for: All records

Award ID contains: 1740822

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. In many information-theoretic channel coding problems, adding an input cost constraint to the operational setup amounts to restricting the optimization domain in the capacity formula. This paper shows that, in contrast to common belief, such a simple modification does not hold for the cost-constrained (CC) wiretap channel (WTC). The secrecy-capacity of the discrete memoryless (DM) WTC without cost constraints is described by a single auxiliary random variable. For the CC DM-WTC, however, we show that two auxiliaries are necessary to achieve capacity. Specifically, we first derive the secrecy-capacity formula, proving the direct part via superposition coding. Then, we provide anmore »example of a CC DM-WTC whose secrecy-capacity cannot be achieved using a single auxiliary. This establishes the fundamental role of superposition coding over CC WTCs.« less
    Free, publicly-accessible full text available January 1, 2023
  2. Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿-divergences---Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools frommore »empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.« less
    Free, publicly-accessible full text available January 1, 2023
  3. Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. This paper explores this tradeoffmore »by means of non-asymptotic error bounds, focusing on three popular choices of SDs—Kullback-Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. Numerical results validating the theory are also provided.« less
    Free, publicly-accessible full text available January 1, 2023
  4. In many information-theoretic channel coding prob- lems, adding an input cost constraint to the operational setup amounts to restricting the optimization domain in the capacity formula. This paper shows that, in contrast to common belief, such a simple modification does not hold for the cost-constrained (CC) wiretap channel (WTC). The secrecy-capacity of the discrete memoryless (DM) WTC without cost constraints is described by a single auxiliary random variable. For the CC DM-WTC, however, we show that two auxiliaries are necessary to achieve capacity. Specifically, we first derive the secrecy-capacity formula, proving the direct part via superposition coding. Then, we providemore »an example of a CC DM-WTC whose secrecy-capacity cannot be achieved using a single auxiliary. This establishes the fundamental role of superposition coding over CC WTCs.« less
  5. Statistical distances (SDs), which quantify the dissimilarity between probability distri- butions, are central to machine learning and statistics. A modern method for esti- mating such distances from data relies on parametrizing a variational form by a neu- ral network (NN) and optimizing it. These estimators are abundantly used in prac- tice, but corresponding performance guar- antees are partial and call for further ex- ploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controllingmore »complexity. This paper explores this tradeoff by means of non-asymptotic error bounds, focusing on three popular choices of SDs—Kullback- Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approxima- tion theorems and tools from empirical pro- cess theory. Numerical results validating the theory are also provided.« less
  6. We consider the problem of soft-covering with constant composition superposition codes and characterize the optimal soft-covering exponent. A double-exponential concentration bound for deviation of the exponent from its mean is also established. We demonstrate an application of the result to achieving the secrecy-capacity region of a broadcast channel with confidential messages under a per-codeword cost constraint. This generalizes the recent characterization of the wiretap channel secrecy-capacity under an average cost constraint, highlighting the potential utility of the superposition soft-covering result to the analysis of coding problems.
  7. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP. The learnedmore »P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to directly translate into improvements for offline RL.« less
  8. Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD’s final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. themore »number of iterations that SGD is run for) is known in advance, the behavior of SGD’s final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex √ case and a factor of shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD’s final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsize scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.« less
  9. The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and offpolicy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we present and analyze a family of counterfactual estimators which subsumes most estimators proposed to date. Most importantly, this analysis identifies a new estimator – called Continuous Adaptive Blending (CAB) – which enjoys manymore »advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is subdifferentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.« less
  10. Given a matrix A∈ℝn×d and a vector b∈ℝd, we show how to compute an ϵ-approximate solution to the regression problem minx∈ℝd12‖Ax−b‖22 in time Õ ((n+d⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1) where κsum=tr(A⊤A)/λmin(ATA) and s is the maximum number of non-zero entries in a row of A. Our algorithm improves upon the previous best running time of Õ ((n+n⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1). We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also providemore »a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.« less