This content will become publicly available on December 10, 2024
 Award ID(s):
 2238523
 NSFPAR ID:
 10489625
 Publisher / Repository:
 Conference on Neural Information Processing Systems (NeurIPS), 2023
 Date Published:
 Subject(s) / Keyword(s):
 ["theory","score matching","exponential families","sample complexity","computational hardness"]
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Deep generative models parametrized up to a normalizing constant (e.g. energybased models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood for the training data, we instead fit the score function, obviating the need to evaluate the partition function. Though this estimator is known to be consistent, it's unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihoodwhich is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimatedi.e. the Poincar\'e, logSobolev and isoperimetric constantquantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constanteven for simple families of distributions like exponential families with rich enough sufficient statisticsscore matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.more » « less

Abstract The $p$tensor Ising model is a oneparameter discrete exponential family for modeling dependent binary data, where the sufficient statistic is a multilinear form of degree $p \geqslant 2$. This is a natural generalization of the matrix Ising model that provides a convenient mathematical framework for capturing, not just pairwise, but higherorder dependencies in complex relational data. In this paper, we consider the problem of estimating the natural parameter of the $p$tensor Ising model given a single sample from the distribution on $N$ nodes. Our estimate is based on the maximum pseudolikelihood (MPL) method, which provides a computationally efficient algorithm for estimating the parameter that avoids computing the intractable partition function. We derive general conditions under which the MPL estimate is $\sqrt N$consistent, that is, it converges to the true parameter at rate $1/\sqrt N$. Our conditions are robust enough to handle a variety of commonly used tensor Ising models, including spin glass models with random interactions and models where the rate of estimation undergoes a phase transition. In particular, this includes results on $\sqrt N$consistency of the MPL estimate in the wellknown $p$spin Sherrington–Kirkpatrick model, spin systems on general $p$uniform hypergraphs and Ising models on the hypergraph stochastic block model (HSBM). In fact, for the HSBM we pin down the exact location of the phase transition threshold, which is determined by the positivity of a certain meanfield variational problem, such that above this threshold the MPL estimate is $\sqrt N$consistent, whereas below the threshold no estimator is consistent. Finally, we derive the precise fluctuations of the MPL estimate in the special case of the $p$tensor Curie–Weiss model, which is the Ising model on the complete $p$uniform hypergraph. An interesting consequence of our results is that the MPL estimate in the Curie–Weiss model saturates the Cramer–Rao lower bound at all points above the estimation threshold, that is, the MPL estimate incurs no loss in asymptotic statistical efficiency in the estimability regime, even though it is obtained by minimizing only an approximation of the true likelihood function for computational tractability.more » « less

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easytosample noise distribution q, in a manner that avoids having to calculate a partition function. It is wellknown that the choice of q can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for q is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss, even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of q will be problematic in practice, suggesting that more complex and tailored noise distributions are essential to the success of NCE.more » « less

Finding an approximate secondorder stationary point (SOSP) is a wellstudied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with dimensionindependent accuracy guarantees, using $\widetilde{O}({D^2}/{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints. As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.more » « less

Abstract Meta‐analysis is a statistical methodology for combining information from diverse sources so that a more reliable and efficient conclusion can be reached. It can be conducted by either synthesizing study‐level summary statistics or drawing inference from an overarching model for individual participant data (IPD) if available. The latter is often viewed as the “gold standard.” For random‐effects models, however, it remains not fully understood whether the use of IPD indeed gains efficiency over summary statistics. In this paper, we examine the relative efficiency of the two methods under a general likelihood inference setting. We show theoretically and numerically that summary‐statistics‐based analysis is
at most as efficient as IPD analysis, provided that the random effects follow the Gaussian distribution, and maximum likelihood estimation is used to obtain summary statistics. More specifically, (i) the two methods are equivalent in an asymptotic sense; and (ii) summary‐statistics‐based inference can incur an appreciable loss of efficiency if the sample sizes are not sufficiently large. Our results are established under the assumption that the between‐study heterogeneity parameter remains constant regardless of the sample sizes, which is different from a previous study. Our findings are confirmed by the analyses of simulated data sets and a real‐world study of alcohol interventions.