Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and JensenShannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.Free, publiclyaccessible full text available June 26, 2023

We provide an informationtheoretic analy sis of the generalization ability of Gibbs based transfer learning algorithms by focus ing on two popular empirical risk minimiza tion (ERM) approaches for transfer learning, αweightedERM and twostageERM. Our key result is an exact characterization of the generalization behavior using the conditional symmetrized KullbackLeibler (KL) informa tion between the output hypothesis and the target training samples given the source train ing samples. Our results can also be applied to provide novel distributionfree generaliza tion error upper bounds on these two afore mentioned Gibbs algorithms. Our approach is versatile, as it also characterizes the gener alization errors and excess risks of these two Gibbs algorithms in the asymptotic regime, where they converge to the αweightedERM and twostageERM, respectively. Based on our theoretical results, we show that the ben efits of transfer learning can be viewed as a biasvariance tradeoff, with the bias induced by the source distribution and the variance induced by the lack of target samples. We believe this viewpoint can guide the choice of transfer learning algorithms in practice.Free, publiclyaccessible full text available March 1, 2023

Various approaches have been developed to upper bound the generalization error of a supervised learning algorithm. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm.Our main contribution is an exact characterization of the expected generalization error of the wellknown Gibbs algorithm (a.k.a. Gibbs posterior) using symmetrized KL information between the input training samples and the output hypothesis. Our result can be applied to tighten existing expected generalization error and PACBayesian bounds. Our approach is versatile, as it also characterizes the generalization error of the Gibbs algorithm with datadependent regularizer and that of the Gibbs algorithm in the asymptotic regime, where it converges to the empirical risk minimization algorithm. Of particular relevance, our results highlight the role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.

Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuoustime Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of crosssectional data. We explore a perhaps surprising remedy: including a number of additional independent items can help determine time order, and hence resolve underspecification. This is in sharp contrast to the common practice of limiting the analysis to a small subset of relevant items, which is followed largely due to poor scaling of existing methods. To put our theoretical insight into practice, we develop an approximate likelihood maximization method for learning continuoustime Markov chains, which can scale to hundreds of items and is orders of magnitude faster than previous methods. We demonstrate the effectiveness of our approach on synthetic and real cancer data.

We generalize the information bottleneck (IB) and privacy funnel (PF) problems by introducing the notion of a sensitive attribute, which arises in a growing number of applications. In this generalization, we seek to construct representations of observations that are maximally (or minimally) informative about a target variable, while also satisfying constraints with respect to a variable corresponding to the sensitive attribute. In the Gaussian and discrete settings, we show that by suitably approximating the KullbackLiebler (KL) divergence defining traditional Shannon mutual information, the generalized IB and PF problems can be formulated as semidefinite programs (SDPs), and thus efficiently solved, which is important in applications of highdimensional inference. We validate our algorithms on synthetic data and demonstrate their use in imposing fairness in machine learning on real data as an illustrative application.

A modal decomposition is a useful tool that deconstructs the statistical dependence between two random variables by decomposing their joint distribution into orthogonal modes. Historically, modal decompositions have played important roles in statistics and information theory, e.g., in the study of maximal correlation. They are defined using the singular value decompo sitions of divergence transition matrices (DTMs) and conditional expectation operators corresponding to joint distributions. In this paper, we first characterize the set of all DTMs, and illustrate how the associated conditional expectation operators are the only weak contractions among a class of natural candidates. While modal decompositions have several modern machine learning applications, such as feature extraction from categorical data, the sample complexity of estimating them in such scenarios has not been analyzed. Hence, we also establish some nonasymptotic sample complexity results for the problem of estimating dominant modes of an unknown joint distribution from training data.

Submodular function minimization is well studied, and existing algorithms solve it exactly or up to arbitrary accuracy. However, in many applications, such as structured sparse learning or batch Bayesian optimization, the objective function is not exactly submodular, but close. In this case, no theoretical guarantees exist. Indeed, submodular minimization algorithms rely on intricate connections between submodularity and convexity. We show how these relations can be extended to obtain approximation guarantees for minimizing nonsubmodular functions, characterized by how close the function is to submodular. We also extend this result to noisy function evaluations. Our approximation results are the first for minimizing nonsubmodular functions, and are optimal, as established by our matching lower bound.

Many directionofarrival (DOA) estimation algo rithms require accurate measurements from all sensing elements on an antenna array. However, in various practical settings, it becomes imperative to perform DOA estimation even in the presence of faulty elements. In this work, we develop an algorithm that can jointly estimate the DOA of sources and the locations of the faulty elements. This is achieved by introducing weights that describe the degree of outlierness of each element. Further, for situations where only single snapshots are available, we propose a new snapshot diversity formulation for which our algorithm can still be applied. Simulation results over four different fault models demonstrate that the proposed algorithm robustly estimates DOAs and accurately identifies the faulty elements.

ubmodular functions have applications throughout machine learning, but in many settings, we do not have direct access to the underlying function f . We focus on stochastic functions that are given as an expectation of functions over a distribution P. In practice, we often have only a limited set of samples fi from P . The standard approach indirectly optimizes f by maximizing the sum of fi. However, this ignores generalization to the true (unknown) distribution. In this paper, we achieve better performance on the actual underlying function f by directly optimizing a combination of bias and variance. Algorith mically, we accomplish this by showing how to carry out distributionally robust optimiza tion (DRO) for submodular functions, pro viding efficient algorithms backed by theoret ical guarantees which leverage several novel contributions to the general theory of DRO. We also show compelling empirical evidence that DRO improves generalization to the un known stochastic submodular function.