Using a subset of observed network links, high-order link prediction (HOLP) infers missing hyperedges, that is links connecting three or more nodes. HOLP emerges in several applications, but existing approaches have not dealt with the associated predictor’s performance. To overcome this limitation, the present contribution develops a Bayesian approach and the relevant predictive distributions that quantify model uncertainty. Gaussian processes model the dependence of each node to the remaining nodes. These nonparametric models yield predictive distributions, which are fused across nodes by means of a pseudo-likelihood based criterion. Performance is quantified by proper measures of dispersion, which are associated with the predictive distributions. Tests on benchmark datasets demonstrate the benefits of the novel approach.
more »
« less
Higher-Order Link Prediction Via Learnable Maximum Mean Discrepancy
Higher-order link prediction (HOLP) seeks missing links capturing dependencies among three or more network nodes. Predicting high-order links (HOLs) can for instance reveal hyperlinks in the structure of drug substance and metabolic networks. Existing methods either make restrictive assumptions regarding the emergence of HOLs, or, they rely on reduced dimensionality models of limited expressiveness. To overcome these limitations, the HOLP approach developed here leverages distribution similarities across embeddings as captured by a learnable probability metric. The intuition underpinning the novel approach is that sets of nodes whose embeddings are less similar in distribution, are less likely to be connected by a HOL. Specifically, nonlinear dimensionality reduction is effected through a Gaussian process latent variable model that yields nodal embeddings, and also learns a data-driven similarity function (kernel). This kernel forms the core of a maximum mean discrepancy probability metric. Tests on benchmark datasets illustrate the potential of the proposed approach.
more »
« less
- PAR ID:
- 10424903
- Date Published:
- Journal Name:
- IEEE International Conference on Acoustics, Speech, and Signal Processing
- Page Range / eLocation ID:
- 1 to 5
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Finding the mode of a high dimensional probability distribution $$\mathcal{D}$$ is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when $$\mathcal{D}$$ is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (LeeLiMusco:2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.’s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy $$(1-\epsilon)$$ for any $$\epsilon > 0$$. Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless $$\mathit{P} = \mathit{NP}$$. Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction.more » « less
-
Metric embeddings traditionally study how to map n items to a target metric space such that distance lengths are not heavily distorted. However, what if we are only interested in preserving the relative order of the distances, rather than their exact lengths? In this paper, we explore the fundamental question: given triplet comparisons of the form “item i is closer to item j than to item k,” can we find low-dimensional Euclidean representations for the n items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications—such as recommendations, ranking, crowdsourcing, psychometrics, and nearest-neighbor search—and have been studied since the 1950s under the name of ordinal or non-metric embeddings. Our main results include: Nearly-Tight Bounds on Triplet Dimension: We introduce the concept of triplet dimension of a dataset and show, surprisingly, that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as n^2 in the worst case. This is nearly optimal, as n−1 dimensions always suffice. Tradeoffs for Dimension vs (Ordinal) Relaxation: We relax the requirement that every triplet must be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding. This ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion. New Bounds on Terminal and Top-k-NNs Embeddings: Moving beyond triplets, we study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings where we aim to preserve relative distance orders to k given items (the “terminals”), and for that, we present matching upper and lower bounds. The second scenario is top-k-NNs Ordinal Embeddings, where for each item we aim to preserve the relative order of its k nearest neighbors, for which we present lower bounds. To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-k-NNs Ordinal Embeddings.more » « less
-
Heterogeneous Information Network (HIN), where nodes and their attributes denote real-world entities and links encode relationships between entities, are ubiquitous in many applications. The presence of multiple types of nodes and links pose significant challenges to the state-of-the-art methods for learning node embeddings from heterogeneous graphs. To address these challenges, we consider three variants of graph variational autoencoder models for heterogeneous networks that avoid the computationally expensive sampling of meta-paths. The proposed methods also maintain uncertainty estimates of node embeddings that help improve generalization performance. We report the results of experiments on link prediction using three different real-world heterogeneous network benchmark data sets that show that the proposed methods significantly outperform state-of-the-art baselines.more » « less
-
Measuring conditional dependence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional dependence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional dependence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular kernel conditional dependence measures based on the Hilbert-Schmidt norm of a certain crossconditional covariance operator, do not have a simple distance representation, except in some limiting cases.more » « less