Abstract The $p$tensor Ising model is a oneparameter discrete exponential family for modeling dependent binary data, where the sufficient statistic is a multilinear form of degree $p \geqslant 2$. This is a natural generalization of the matrix Ising model that provides a convenient mathematical framework for capturing, not just pairwise, but higherorder dependencies in complex relational data. In this paper, we consider the problem of estimating the natural parameter of the $p$tensor Ising model given a single sample from the distribution on $N$ nodes. Our estimate is based on the maximum pseudolikelihood (MPL) method, which provides a computationally efficient algorithm for estimating the parameter that avoids computing the intractable partition function. We derive general conditions under which the MPL estimate is $\sqrt N$consistent, that is, it converges to the true parameter at rate $1/\sqrt N$. Our conditions are robust enough to handle a variety of commonly used tensor Ising models, including spin glass models with random interactions and models where the rate of estimation undergoes a phase transition. In particular, this includes results on $\sqrt N$consistency of the MPL estimate in the wellknown $p$spin Sherrington–Kirkpatrick model, spin systems on general $p$uniform hypergraphs and Ising models on the hypergraph stochastic block model (HSBM). In fact, for the HSBM we pin down the exact location of the phase transition threshold, which is determined by the positivity of a certain meanfield variational problem, such that above this threshold the MPL estimate is $\sqrt N$consistent, whereas below the threshold no estimator is consistent. Finally, we derive the precise fluctuations of the MPL estimate in the special case of the $p$tensor Curie–Weiss model, which is the Ising model on the complete $p$uniform hypergraph. An interesting consequence of our results is that the MPL estimate in the Curie–Weiss model saturates the Cramer–Rao lower bound at all points above the estimation threshold, that is, the MPL estimate incurs no loss in asymptotic statistical efficiency in the estimability regime, even though it is obtained by minimizing only an approximation of the true likelihood function for computational tractability.
more »
« less
Distributionally Robust Parametric Maximum Likelihood Estimation
We consider the parameter estimation problem of a probabilistic generative model prescribed using a natural exponential family of distributions. For this problem, the typical maximum likelihood estimator usually overfits under limited training sample size, is sensitive to noise and may perform poorly on downstream predictive tasks. To mitigate these issues, we propose a distributionally robust maximum likelihood estimator that minimizes the worstcase expected logloss uniformly over a parametric KullbackLeibler ball around a parametric nominal distribution. Leveraging the analytical expression of the KullbackLeibler divergence between two distributions in the same natural exponential family, we show that the minmax estimation problem is tractable in a broad setting, including the robust training of generalized linear models. Our novel robust estimator also enjoys statistical consistency and delivers promising empirical results in both regression and classification tasks.
more »
« less
 Award ID(s):
 1915967
 NSFPAR ID:
 10285215
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 Volume:
 33
 ISSN:
 10495258
 Page Range / eLocation ID:
 79227932
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish nonasymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿divergencesKullbackLeibler, chisquared, squared Hellinger, and total variation. Our analysis relies on nonasymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growthrate are minimax rateoptimal, achieving the parametric convergence rate.more » « less

null (Ed.)Transformationbased methods have been an attractive approach in nonparametric inference for problems such as unconditional and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformationbased models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both nonparametric inference and variational inference lacks theoretical justification. We provide theoretical justification for the use of nonlinear latent variable models (NLLVMs) in nonparametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the L1 sense. We also show that, when a Gaussian process (GP) prior is placed on the transformation function, the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the nonparametric setting, we use the NLLVM to construct an implicit family of variational distributions, deemed GPIVI. We delineate sufficient conditions under which GPIVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback–Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit variational inference.more » « less

We consider the question of learning the natural parameters of a k parameter \textit{minimal} exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the natural parameters are appropriately bounded. While the traditional maximum likelihood estimator for this class of exponential family is consistent, asymptotically normal, and asymptotically efficient, evaluating it is computationally hard. In this work, we propose a computationally efficient estimator that is consistent as well as asymptotically normal under mild conditions. We provide finite sample guarantees to achieve an l2 error of α in the parameter estimation with sample complexity O(poly(k/α)) and computational complexity O(poly(k/α)). To establish these results, we show that, at the population level, our method can be viewed as the maximum likelihood estimation of a reparameterized distribution belonging to the same class of exponential family.more » « less

Meila, Marina and (Ed.)Least squares estimators, when trained on a few target domain samples, may predict poorly. Supervised domain adaptation aims to improve the predictive accuracy by exploiting additional labeled training samples from a source distribution that is close to the target distribution. Given available data, we investigate novel strategies to synthesize a family of least squares estimator experts that are robust with regard to moment conditions. When these moment conditions are specified using KullbackLeibler or Wassersteintype divergences, we can find the robust estimators efficiently using convex optimization. We use the Bernstein online aggregation algorithm on the proposed family of robust experts to generate predictions for the sequential stream of target test samples. Numerical experiments on real data show that the robust strategies systematically outperform nonrobust interpolations of the empirical least squares estimators.more » « less