Ranking on largescale graphs plays a fundamental role in many highimpact application domains, ranging from information retrieval, recommender systems, sports team management, biology to neuroscience and many more. PageRank, together with many of its random walk based variants, has become one of the most wellknown and widely used algorithms, due to its mathematical elegance and the superior performance across a variety of application domains. Important as it might be, stateoftheart lacks an intuitive way to explain the ranking results by PageRank (or its variants), e.g., why it thinks the returned topk webpages are the most important ones in the entire graph; why it gives a higher rank to actor John than actor Smith in terms of their relevance w.r.t. a particular movie? In order to answer these questions, this paper proposes a paradigm shift for PageRank, from identifying which nodes are most important to understanding why the ranking algorithm gives a particular ranking result. We formally define the PageRank auditing problem, whose central idea is to identify a set of key graph elements (e.g., edges, nodes, subgraphs) with the highest influence on the ranking results. We formulate it as an optimization problem and propose a family of effective and scalable algorithms (Aurora) to solve it. Our algorithms measure the influence of graph elements and incrementally select influential elements w.r.t. their gradients over the ranking results. We perform extensive empirical evaluations on realworld datasets, which demonstrate that the proposed methods (Aurora) provide intuitive explanations with a linear scalability.
more »
« less
Fast and Scalable Complex Network Descriptor Using PageRank and Persistent Homology
The PageRank of a graph is a scalar function defined on the node set of the graph which encodes nodes centrality information of the graph. In this article we use the PageRank function along with persistent homology to obtain a scalable graph descriptor and utilize it to compare the similarities between graphs. For a given graph G(V, E), our descriptor can be computed in O(Eα(V)), where a is the inverse Ackermann function which makes it scalable and computable on massive graphs. We show the effectiveness of our method by utilizing it on multiple shape mesh datasets.
more »
« less
 Award ID(s):
 1845204
 NSFPAR ID:
 10250296
 Date Published:
 Journal Name:
 2020 International Conference on Intelligent Data Science Technologies and Applications (IDSTA)
 Page Range / eLocation ID:
 110 to 114
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Over the last two decades, frameworks for distributedmemory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the defacto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental problem in analyzing massive graphs, with the main algorithmic challenges centering on designing methods which are both scalable and accurate. Given a graph G = (V, E) with n vertices, m edges and T triangles, our first result is an algorithm that outputs a (1+ε)approximation to T, with asymptotically optimal round and total space complexity provided any S ≥ max{(√ m, n²/m)} space per machine and assuming T = Ω(√{m/n}). Our result gives a quadratic improvement on the bound on T over previous works. We also provide a simple extension of our result to counting any subgraph of k size for constant k ≥ 1. Our second result is an O_δ(log log n)round algorithm for exactly counting the number of triangles, whose total space usage is parametrized by the arboricity α of the input graph. We extend this result to exactly counting kcliques for any constant k. Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most 5 can be implemented in the MPC model in Õ_δ(√{log n}) rounds, O(n^δ) space per machine and O(mα³) total space. In addition to our theoretical results, we simulate our triangle counting algorithms in realworld graphs obtained from the Stanford Network Analysis Project (SNAP) database. Our results show that both our approximate and exact counting algorithms exhibit improvements in terms of round complexity and approximation ratio, respectively, compared to two previous widely used algorithms for these problems.more » « less

Topological Data Analysis (TDA) studies the “shape” of data. A common topological descriptor is the persistence diagram, which encodes topological features in a topological space at different scales. Turner, Mukherjee, and Boyer showed that one can reconstruct a simplicial complex embedded in R^3 using persistence diagrams generated from all possible height filtrations (an uncountably infinite number of directions). In this paper, we present an algorithm for reconstructing plane graphs K = (V, E) in R^2, i.e., a planar graph with vertices in general position and a straightline embedding, from a quadratic number height filtrations and their respective persistence diagrams.more » « less

Topological Data Analysis (TDA) studies the shape of data. A common topological descriptor is the persistence diagram, which encodes topological features in a topological space at different scales. Turner, Mukeherjee, and Boyer showed that one can reconstruct a simplicial complex embedded in R^3 using persistence diagrams generated from all possible height filtrations (an uncountably infinite number of directions). In this paper, we present an algorithm for reconstructing plane graphs K=(V,E) in R^2 , i.e., a planar graph with vertices in general position and a straightline embedding, from a quadratic number height filtrations and their respective persistence diagrams.more » « less

null (Ed.)Motivated by the increasing need to understand the distributed algorithmic foundations of largescale graph computations, we study some fundamental graph problems in a messagepassing model for distributed computing where k ≥ 2 machines jointly perform computations on graphs with n nodes (typically, n >> k). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many realworld systems. Communication is pointtopoint, and the goal is to minimize the number of communication rounds of the computation. Our main contribution is the General Lower Bound Theorem , a theorem that can be used to show nontrivial lower bounds on the round complexity of distributed largescale data computations. This result is established via an informationtheoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic, and this theorem can be used in a “cookbook” fashion to show distributed lower bounds for several problems, including nongraph problems. We present two applications by showing (almost) tight lower bounds on the round complexity of two fundamental graph problems, namely, PageRank computation and triangle enumeration . These applications show that our approach can yield lower bounds for problems where the application of communication complexity techniques seems not obvious or gives weak bounds, including and especially under a stochastic partition of the input. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity that scales superlinearly in k , improving significantly over previous results [Klauck et al., SODA 2015]. Specifically, we show the following results: PageRank: We show a lower bound of Ὼ(n/k 2 ) rounds and present a distributed algorithm that computes an approximation of the PageRank of all the nodes of a graph in Õ(n/k 2 ) rounds. Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires Ὼ(m/k 5/3 ) rounds. This result also implies the first nontrivial lower bound of Ὼ(n 1/3 ) rounds for the congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in Õ(m/k 5/3 + n/k 4/3 ) rounds.more » « less