Abstract The $p$tensor Ising model is a oneparameter discrete exponential family for modeling dependent binary data, where the sufficient statistic is a multilinear form of degree $p \geqslant 2$. This is a natural generalization of the matrix Ising model that provides a convenient mathematical framework for capturing, not just pairwise, but higherorder dependencies in complex relational data. In this paper, we consider the problem of estimating the natural parameter of the $p$tensor Ising model given a single sample from the distribution on $N$ nodes. Our estimate is based on the maximum pseudolikelihood (MPL) method, which provides a computationally efficient algorithm for estimating the parameter that avoids computing the intractable partition function. We derive general conditions under which the MPL estimate is $\sqrt N$consistent, that is, it converges to the true parameter at rate $1/\sqrt N$. Our conditions are robust enough to handle a variety of commonly used tensor Ising models, including spin glass models with random interactions and models where the rate of estimation undergoes a phase transition. In particular, this includes results on $\sqrt N$consistency of the MPL estimate in the wellknown $p$spin Sherrington–Kirkpatrick model, spin systems on general $p$uniform hypergraphs and Ising models on the hypergraph stochastic block model (HSBM). In fact, for the HSBM we pin down the exact location of the phase transition threshold, which is determined by the positivity of a certain meanfield variational problem, such that above this threshold the MPL estimate is $\sqrt N$consistent, whereas below the threshold no estimator is consistent. Finally, we derive the precise fluctuations of the MPL estimate in the special case of the $p$tensor Curie–Weiss model, which is the Ising model on the complete $p$uniform hypergraph. An interesting consequence of our results is that the MPL estimate in the Curie–Weiss model saturates the Cramer–Rao lower bound at all points above the estimation threshold, that is, the MPL estimate incurs no loss in asymptotic statistical efficiency in the estimability regime, even though it is obtained by minimizing only an approximation of the true likelihood function for computational tractability.
more »
« less
A Computationally Efficient Method for Learning Exponential Family Distributions
We consider the question of learning the natural parameters of a k
parameter \textit{minimal} exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the natural parameters are appropriately bounded. While the traditional maximum likelihood estimator for this class of exponential family is consistent, asymptotically normal, and asymptotically efficient, evaluating it is computationally hard. In this work, we propose a computationally efficient estimator that is consistent as well as asymptotically normal under mild conditions. We provide finite sample guarantees to achieve an l2 error of α in the parameter estimation with sample complexity
O(poly(k/α)) and computational complexity O(poly(k/α)). To establish these results, we show that, at the population level, our method can be viewed as the maximum likelihood estimation of a reparameterized distribution belonging to the same class of exponential family.
more »
« less
 Award ID(s):
 1816209
 NSFPAR ID:
 10374104
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variabley=wTx+εand its corresponding vector ofcovariatesx∈Rkare only revealed if the dependent variable falls in some subsetS⊆R; otherwisethe existence of the pair(x,y)is hidden. This problem has remained a challenge since the earlyworks of Tobin (1958); Amemiya (1973); Hausman and Wise (1977); Breen et al. (1996), its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee,and Fisher Galton (1897); Pearson and Lee (1908); Lee (1914); Fisher (1931). While consistent estimators of the regression coefficients have been identified, the error rates are not wellunderstood,especially in highdimensional settings.Under a “thickness assumption” about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vectorwfromnrevealed samples that attains`2errorO(√k/n), recovering the guarantees of least squares in thestandard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative loglikelihood of the truncated sample, and only needs oracle access to the setS, which may otherwise be arbitrary, and in particular may be nonconvex.PSGD must be restricted to an appropriately defined convex cone to guarantee that the negativeloglikelihood is strongly convex, which in turn is established using concentration of matrices onvariables with subexponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator.As a corollary of our work, we show that SGD provably learns the parameters of singlelayerneural networks with noisy Relu activation functions Nair and Hinton (2010); Bengio et al. (2013);Gulcehre et al. (2016), given linearly many, in the number of network parameters, inputoutputpairs in the realizable setting.more » « less

null (Ed.)We consider the parameter estimation problem of a probabilistic generative model prescribed using a natural exponential family of distributions. For this problem, the typical maximum likelihood estimator usually overfits under limited training sample size, is sensitive to noise and may perform poorly on downstream predictive tasks. To mitigate these issues, we propose a distributionally robust maximum likelihood estimator that minimizes the worstcase expected logloss uniformly over a parametric KullbackLeibler ball around a parametric nominal distribution. Leveraging the analytical expression of the KullbackLeibler divergence between two distributions in the same natural exponential family, we show that the minmax estimation problem is tractable in a broad setting, including the robust training of generalized linear models. Our novel robust estimator also enjoys statistical consistency and delivers promising empirical results in both regression and classification tasks.more » « less

We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variabley=wTx+εand its corresponding vector ofcovariatesx∈Rkare only revealed if the dependent variable falls in some subsetS⊆R; otherwisethe existence of the pair(x,y)is hidden. This problem has remained a challenge since the earlyworks of Tobin (1958); Amemiya (1973); Hausman and Wise (1977); Breen et al. (1996), its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee,and Fisher Galton (1897); Pearson and Lee (1908); Lee (1914); Fisher (1931). While consistent estimators of the regression coefficients have been identified, the error rates are not wellunderstood,especially in highdimensional settings.Under a “thickness assumption” about the covariance matrix of the covariates in the revealedsample, we provide a computationally efficient estimator for the coefficient vectorwfromnrevealed samples that attains`2errorO(√k/n), recovering the guarantees of least squares in thestandard (untruncated) linear regression setting. Our estimator uses Projected Stochastic Gradient Descent (PSGD) on the negative loglikelihood of the truncated sample, and only needs oracle access to the setS, which may otherwise be arbitrary, and in particular may be nonconvex.PSGD must be restricted to an appropriately defined convex cone to guarantee that the negativeloglikelihood is strongly convex, which in turn is established using concentration of matrices onvariables with subexponential tails. We perform experiments on simulated data to illustrate theaccuracy of our estimator.As a corollary of our work, we show that SGD provably learns the parameters of singlelayerneural networks with noisy Relu activation functions Nair and Hinton (2010); Bengio et al. (2013);Gulcehre et al. (2016), given linearly many, in the number of network parameters, inputoutputpairs in the realizable setting.more » « less

Deep generative models parametrized up to a normalizing constant (e.g. energybased models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood for the training data, we instead fit the score function, obviating the need to evaluate the partition function. Though this estimator is known to be consistent, it's unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihoodwhich is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimatedi.e. the Poincar\'e, logSobolev and isoperimetric constantquantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constanteven for simple families of distributions like exponential families with rich enough sufficient statisticsscore matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.more » « less