skip to main content

Title: GLAD: Learning sparse graph recovery
Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an L1 regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.
; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
International Conference on Learning Representations (ICLR)
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of accurately recovering a matrix B of size M by M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problem---both in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sides---remain open. Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using O(M) samples} (where we assume the rank is a constant).more »This linear sample complexity is optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. Additionally, we provide an even stronger lower bound showing that distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by a well-conditioned Hidden Markov Model with two hidden states requires Omega(M) observations, while our positive results for recovering B immediately imply that Omega(M) observations suffice to learn such an HMM. This lower bound precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.« less
  2. Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural prior they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the O(N logN) Cooley-Tukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points—the first time a structured approach has done so—with 4X faster inference speed andmore »40X fewer parameters.« less
  3. Many program-analysis problems can be formulated as graph-reachability problems. Interleaved Dyck language reachability ( InterDyck -reachability) is a fundamental framework to express a wide variety of program-analysis problems over edge-labeled graphs. The InterDyck language represents an intersection of multiple matched-parenthesis languages (i.e., Dyck languages). In practice, program analyses typically leverage one Dyck language to achieve context-sensitivity, and other Dyck languages to model data dependencies, such as field-sensitivity and pointer references/dereferences. In the ideal case, an InterDyck -reachability framework should model multiple Dyck languages simultaneously . Unfortunately, precise InterDyck -reachability is undecidable. Any practical solution must over-approximate the exact answer. In the literature, a lot of work has been proposed to over-approximate the InterDyck -reachability formulation. This article offers a new perspective on improving both the precision and the scalability of InterDyck -reachability: we aim at simplifying the underlying input graph G . Our key insight is based on the observation that if an edge is not contributing to any InterDyck -paths, we can safely eliminate it from G . Our technique is orthogonal to the InterDyck -reachability formulation and can serve as a pre-processing step with any over-approximating approach for InterDyck -reachability. We have applied our graph simplification algorithm tomore »pre-processing the graphs from a recent InterDyck -reachability-based taint analysis for Android. Our evaluation of three popular InterDyck -reachability algorithms yields promising results. In particular, our graph-simplification method improves both the scalability and precision of all three InterDyck -reachability algorithms, sometimes dramatically.« less
  4. Many real-world applications involve longitudinal data, consisting of observations of several variables, where different subsets of variables are sampled at irregularly spaced time points. We introduce the Longitudinal Gaussian Process Latent Variable Model (L-GPLVM), a variant of the Gaussian Process Latent Variable Model, for learning compact representations of such data. L-GPLVM overcomes a key limitation of the Dynamic Gaussian Process Latent Variable Model and its variants, which rely on the assumption that the data are fully observed over all of the sampled time points. We describe an effective approach to learning the parameters of L-GPLVM from sparse observations, by coupling the dynamical model with a Multitask Gaussian Process model for sampling of the missing observations at each step of the gradient-based optimization of the variational lower bound. We further show the advantage of the Sparse Process Convolution framework to learn the latent representation of sparsely and irregularly sampled longitudinal data with minimal computational overhead relative to a standard Latent Variable Model. We demonstrated experiments with synthetic data as well as variants of MOCAP data with varying degrees of sparsity of observations that show that L-GPLVM substantially and consistently outperforms the state-of-the-art alternatives in recovering the missing observations even when themore »available data exhibits a high degree of sparsity. The compact representations of irregularly sampled and sparse longitudinal data can be used to perform a variety of machine learning tasks, including clustering, classification, and regression.« less
  5. Mixture-of-Experts (MoE) is a widely popular model for ensemble learning and is a basic building block of highly successful modern neural networks as well as a component in Gated Recurrent Units (GRU) and Attention networks. However, present algorithms for learning MoE, including the EM algorithm and gradient descent, are known to get stuck in local optima. From a theoretical viewpoint, finding an efficient and provably consistent algorithm to learn the parameters remains a long standing open problem for more than two decades. In this paper, we introduce the first algorithm that learns the true parameters of a MoE model for a wide class of non-linearities with global consistency guarantees. While existing algorithms jointly or iteratively estimate the expert parameters and the gating parameters in the MoE, we propose a novel algorithm that breaks the deadlock and can directly estimate the expert parameters by sensing its echo in a carefully designed cross-moment tensor between the inputs and the output. Once the experts are known, the recovery of gating parameters still requires an EM algorithm; however, we show that the EM algorithm for this simplified problem, unlike the joint EM algorithm, converges to the true parameters. We empirically validate our algorithm onmore »both the synthetic and real data sets in a variety of settings, and show superior performance to standard baselines.« less