<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>V-Combiner: speeding-up iterative graph processing on a shared-memory platform with vertex merging</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/29/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10206306</idno>
					<idno type="doi">10.1145/3392717.3392739</idno>
					<title level='j'>34th International Conference on Supercomputing 2020</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Azin Heidarshenas</author><author>Serif Yesil</author><author>Dimitrios Skarlatos</author><author>Sasa Misailovic</author><author>Adam Morrison</author><author>Josep Torrellas</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[An iterative graph algorithm applies a vertex update operation to all vertices in a graph in every iteration. For large graphs, this computation is costly. However, in practice, not all the updates contribute equally to the end result and, in fact, an exact result may not be needed. In this work, we leverage these insights to speed-up iterative graph algorithms. We propose a mechanism to identify the less important vertices and omit computations for them.Our scheme, called V-Combiner, is a deterministic, fast, and application-transparent technique to construct an approximate graph to enable faster execution. The main idea behind V-Combiner is to merge certain vertices into hubs, which are vertices that have many connections and contribute heavily to the end result of the algorithm. We also propose an inexpensive correction step to recover the contribution of the merged vertices to get higher accuracy.We evaluate V-Combiner on 4 different applications and 5 datasets. For 44-threaded runs, V-Combiner achieves an average end-to-end speedup of 1.25× over the conventional system, with an accuracy of 91.8%. It also shows a better performance-accuracy trade-off than the existing sparsification and k-core techniques.
CCS CONCEPTS• Computing methodologies → Parallel computing methodologies.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Many graph processing applications used in the machine learning, social network, computational biology, and financial system domains are inherently iterative. Examples include Belief Propagation <ref type="bibr">[18,</ref><ref type="bibr">19]</ref>, Community Detection <ref type="bibr">[45]</ref>, Page Rank <ref type="bibr">[34]</ref>, and Hyperlink-Induced Topic Search <ref type="bibr">[22]</ref>. They typically have a property that gets propagated across vertices, and a metric for convergence to a solution, which determines their time complexity.</p><p>Iterative algorithms are inherently costly for large graphs. For example, running Page Rank on billion-vertex graphs can take minutes in a 24-core shared-memory system <ref type="bibr">[25]</ref>. In many environments, this latency is unacceptable. On the other hand, many applications can trade a small amount of accuracy for improved performance <ref type="bibr">[15,</ref><ref type="bibr">32]</ref>. For example, in Page Rank, the user is often interested in a very small subset of the computation results <ref type="bibr">[10,</ref><ref type="bibr">43]</ref> -almost all queries only require to know the top-ranked pages. Further, vertex classification algorithms such as Community Detection and Belief Propagation can tolerate small errors as long as the final inferred vertex labels remain correct.</p><p>Providing high accuracy in approximate iterative graph computations is challenging, as the error introduced in one vertex can propagate to its neighbors (and eventually to all other reachable vertices), and gets accumulated across iterations. For this reason, approximations used in non-iterative graph algorithms such as Single-Source Shortest Path (SSSP) <ref type="bibr">[41]</ref> or Triangle Counting <ref type="bibr">[14]</ref> are not suitable for iterative graph algorithms.</p><p>Previous works for iterative graph algorithms <ref type="bibr">[12,</ref><ref type="bibr">25,</ref><ref type="bibr">33,</ref><ref type="bibr">38]</ref> apply program approximation techniques such as loop perforation <ref type="bibr">[39]</ref> and task skipping <ref type="bibr">[36]</ref> to the input graph <ref type="bibr">[25,</ref><ref type="bibr">33]</ref> or to the program <ref type="bibr">[12,</ref><ref type="bibr">38]</ref>. These techniques speed-up the program. However, by dropping random vertices or connections in the graph, or skipping their computation as the program runs, these techniques introduce non-determinism, which makes debugging difficult and may cause variability in the outputs of the application.</p><p>Graph summarization techniques <ref type="bibr">[28]</ref> generate a simpler form of the graph for faster processing and/or more efficient storage. The most popular of these techniques include sketching <ref type="bibr">[1,</ref><ref type="bibr">7,</ref><ref type="bibr">30,</ref><ref type="bibr">37]</ref>, sparsification <ref type="bibr">[3,</ref><ref type="bibr">15]</ref> and k-core decomposition <ref type="bibr">[13,</ref><ref type="bibr">21,</ref><ref type="bibr">35]</ref>. Sketching techniques summarize the graph information into a data structure that can be queried in the future to provide a fast approximate answer after a few simple computations. However, sketching techniques have a large overhead and, therefore, cannot be used for performance-intensive tasks. The sparsification and k-core decomposition techniques have a much smaller overhead. However, the resulting graphs have performance and accuracy limitations for certain applications. As we will see, these two techniques are unable to achieve both high speedup and high accuracy. Our Work. In this paper, we present V-Combiner, a general, deterministic technique for speeding-up iterative graph-processing applications, while maintaining high-accuracy results. V-Combiner is motivated by the observation that not all vertices contribute equally in iterative graph processing algorithms. Therefore, the key technical challenge is identifying and removing the vertices with a small contribution, and still maintaining the graph properties that dictate the accuracy of the graph algorithm.</p><p>During pre-processing, V-Combiner merges some vertices into their neighboring hubs, which are vertices with many connections. It then executes the unmodified original algorithm on the reduced graph, speeding-up execution. Finally, it corrects the final result of the program to account for the effect of the merged vertices.</p><p>V-Combiner has the following characteristics: &#8226; Application Transparency. Unlike previous work that requires changing the implementation of the graph algorithm <ref type="bibr">[12,</ref><ref type="bibr">38]</ref>, in V-Combiner the application is completely unaware of the merging step applied to the input graph. The application treats the approximate graph in the same way as the original graph. This allows the programmer to develop graph algorithms as before. &#8226; Low Overhead, High Performance, and High Accuracy. V-Combiner has simpler pre-processing and lower overhead than previous work such as k-core <ref type="bibr">[13,</ref><ref type="bibr">21,</ref><ref type="bibr">35]</ref>, GraphTuner <ref type="bibr">[33]</ref> and Input Reduction <ref type="bibr">[25]</ref>. V-Combiner speeds up the execution by creating a smaller approximate graph that needs fewer updates. It retains high accuracy by maintaining important graph properties and using a final correction step. &#8226; Deterministic Behavior. V-Combiner uses a deterministic merging mechanism that produces identical approximate graphs every time. All previous algorithms except k-core are non-deterministic <ref type="bibr">[12,</ref><ref type="bibr">15,</ref><ref type="bibr">25,</ref><ref type="bibr">33,</ref><ref type="bibr">38]</ref>. Compared to k-core, V-Combiner algorithm maximizes the connectivity of the reduced graph.</p><p>Results. We evaluate V-Combiner on a 44-core shared-memory machine running four iterative applications: Belief Propagation, Community Detection, Hyperlink-Induced Topic Search, and Page Rank. We execute each application with 5 real-world data sets. Our main results are:</p><p>&#8226; Considering the algorithm-time only, V-Combiner speeds-up the computation by an average of 1.55&#215; over the exact baseline algorithm at an accuracy level above 90%, compared to average speedups of 1.46&#215; with sparsification and 1.36&#215; with k-core. &#8226; Considering the end-to-end execution time, which includes onetime overheads, the average speedup of V-Combiner over the exact baseline algorithm is 1.25&#215;, with an accuracy of 91.8%. The performance of V-Combiner is equal to sparsification and significantly better than k-core. Unlike sparsification and k-core, it can successfully meet the accuracy bound on all the benchmarks. &#8226; A trade-off exploration shows that V-Combiner can produce a better set of points in the performance-accuracy trade-off space than the other algorithms in the region above 90% accuracy. &#8226; V-Combiner obtains better load balancing, preserves the average length of the paths, produces deterministic results (unlike sparsification), and performs less work at high accuracy due to high connectivity (unlike k-core).</p><p>Contributions. The main contributions of this paper are:</p><p>&#8226; We analyze and compare existing techniques for approximating iterative graph algorithms. &#8226; We develop V-Combiner, a novel general approximation technique to improve the performance of iterative graph algorithms with acceptable accuracy losses. &#8226; We evaluate V-Combiner on a shared-memory machine and compare it to state-of-the-art approaches for approximate graph processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND</head><p>A graph G consists of a set of vertices or vertices (V ) and edges (E).</p><p>A graph can be directed or undirected. In directed graphs, an edge can be incoming or outgoing. In undirected graphs, edges do not have a direction. Therefore, each edge can be treated as two logical edges: an incoming and an outgoing one. We refer to the incoming edges of a vertex as in-edges, and to outgoing edges as out-edges.</p><p>The vertices at the other end of in-edges are the in-neighbors of a vertex. Similarly, the end points of out-edges are the out-neighbors.</p><p>The number of in-edges of a vertex is its in-degree, and the number of out-edges is its out-degree. For undirected graphs, the degree of a vertex is equal to the number of physical edges of the vertex, and is equal to the vertex's in-degree and to the vertex's out-degree. A path in the graph is a sequence of edges that connects a series of distinct vertices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Iterative Graph Algorithms</head><p>Algorithm 1 shows the template of an iterative graph algorithm. The algorithm is composed of multiple iterations. In an iteration, the algorithm goes over all of the vertices in the graph (Line 3). For each vertex, it applies an update function, which gathers information from the neighbors of the vertex (Lines 5-7) -and potentially from the edges. The value of each neighbor vertex is first multiplied by W , which is a vertex-specific constant. For example, in Page Rank, W for neighbor vertex v is equal to</p><p>. The result is then aggregated, using an operand that varies across applications (e.g., addition or multiplication) (Line-7). After that, the update function refines the result stored for the vertex using C 1 and C 2 (Line 8). For Page Rank, C 1 and C 2 are commonly set to 0.85 and 0.15. In every iteration, a change variable (Line 9) accumulates the difference in the values of each vertex before and after the update function. The iterations stop when change is less than a threshold -i.e. the convergence criterion is met (Lines 10-11). The update function creates a notion of information propagation, which can be weights or probabilities. Information propagation generally follows the edges. In undirected graphs, the information propagates in both directions of the edges, whereas in directed graphs, it can propagate in the forward or reverse direction of the edges. Overall, different algorithms differ in the definition of the update function (which is also associated with the direction of edges) and the convergence criterion. In this work, we select four well-known algorithms with different characteristics. Belief Propagation (BP): BP performs inference on a graphical model <ref type="bibr">[18]</ref>. BP calculates a random variable (i.e., the belief) for each vertex, which indicates its most probable state. In BP, the update operation works in both directions, and so propagates the information. Community Detection (CD): CD finds the community structure of a network <ref type="bibr">[23,</ref><ref type="bibr">26,</ref><ref type="bibr">42,</ref><ref type="bibr">45]</ref>. It uses the label propagation method. It takes a directed graph with a set of labeled vertices as input, and propagates the known community IDs to the rest of the vertices in the graph. The update operations happen only in the direction of edges. Hyperlink-Induced Topic Search (HITS): HITS takes as input a directed graph (e.g., a social or web network) and identifies the most relevant (aka authoritative) users/pages given a specific topic <ref type="bibr">[2,</ref><ref type="bibr">22]</ref>. It uses the graph structure to obtain two scores for each vertex, namely the authority and hub. The authority score of a vertex is computed using the hub scores of its in-neighbors, while its hub score is computed using the authority scores of its out-neighbors. The update operations happen both in the direction of edges to calculate authority scores, and in the reverse direction to compute hub scores. Page Rank (PR): PR <ref type="bibr">[34]</ref> computes the ranks of all the vertices based on the graph structure. The input can be any directed graph, such as a social or web graph. At every iteration, the Page Rank score of a vertex is computed by summing up the Page Ranks of its in-neighbors divided by their out-degrees. The update operations happen only in the direction of edges. Other Graph Algorithms. Our focus is on iterative graph algorithms whose time complexity is determined by their convergence or number of iterations. A number of machine learning algorithms on graphs also belong to this category. However, our observations and techniques do not apply to Triangle Counting, Single Source Shortest Path (SSSP), Betweenness Centrality, and similar problems because their time complexity is independent of the number of iterations. Moreover, it is more challenging to provide high accuracy for iterative algorithms, as the error in one vertex gets propagated and accumulated to all the other reachable vertices across iterations. Hence, approximations proposed for algorithms such as SSSP <ref type="bibr">[41]</ref> and Triangle Counting <ref type="bibr">[14]</ref> are not suitable for iterative graph algorithms. Finally, the running times of non-iterative algorithms are relatively low compared to the iterative ones. Hence, there is less opportunity to accelerate them using approximations. A graph application has two main parts: pre-processing and computation. Existing work often ignores the pre-processing time and only focuses on optimizing the computation. However, there exists a tradeoff between pre-processing time and algorithm execution time <ref type="bibr">[29]</ref>. Thus, it is important to account for the pre-processing time when optimizing computation time. Figure <ref type="figure">1</ref> shows that the pre-processing time is a significant fraction of the total execution time for some of our applications. These results are measured on a machine with 44 threads (Section 7.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Graph Pre-processing</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">OBSERVATIONS</head><p>V-Combiner is based on several observations we make on graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Not All Vertices Contribute Equally</head><p>In Algorithm 1, the number of updates generated at a vertex is equal to the number of neighbors it has. For directed graphs, it is equal to the in-degree if updates happen in the direction of edges, or to the out-degree if they happen in the reverse direction. Therefore, vertices with higher degrees have a higher impact on how information is propagated throughout the graph. Typically, in real-world social and web graphs, a significant portion of the vertices have small degrees. For example, for the large Twitter <ref type="bibr">[27]</ref> and PLD <ref type="bibr">[31]</ref> graph datasets, we find that, on average, 97.4% of the vertices are connected to at most 100 other vertices. In contrast, 0.00005% of the vertices are connected to at least 100,000 vertices. They are typically identified as hubs, and substantially impact how information is propagated. Figure <ref type="figure">2</ref> shows an example subgraph. It has three vertices that are highly connected ( 1 &#9675;, 3  &#9675;, and 4 &#9675;), and one that is not ( 2 &#9675;). The former have more impact on information flow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">A Subset of Results Is All That Is Needed</head><p>In many real-life applications of graph processing, the application computes over many vertices, but the user is only interested in a very small subset of the results. The subset of most popular vertices is enough in many scenarios. For example, a company may want to identify its most influential customers, or a researcher may want to find the top authoritative researchers in some domain in the DBLP network <ref type="bibr">[43]</ref> or the top authoritative pages in a hyperlink environment <ref type="bibr">[22]</ref>. In fact, the Page Rank algorithm in <ref type="bibr">[10]</ref> returns only the top ranked vertices. Therefore, it should often be acceptable for an approximate implementation of the algorithm to return the same top ranked vertices even though the ranking of the less popular vertices may be different.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Applications Can Tolerate Small Errors</head><p>In applications such as Belief Propagation and Community Detection, the goal is to infer the broad category of each vertex, rather than to compute an exact value for each vertex. Each vertex has a vector of values, where each element of the vector corresponds to the probability of the vertex belonging to a specific category. The highest probability indicates the category of the vertex. As such, small errors in the probability values can be tolerated as long as the inferred vertex category remains the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Removing Some Vertices Can Be Effective</head><p>Removing some vertices or edges reduces the number and cost of update operations. If the high-level structure of the graph is preserved, the computation on the reduced graph could be a close approximation of that on the original graph. Our intuition is that, since hub vertices shape most of the information flow in the graph, vertices from their neighborhood that have a small degree can be merged into the hubs, and the overall flow of information in the graph be only marginally affected. Merging a vertex into a hub involves removing the vertex and adding all (or a subset) of its edges to the hub vertex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">LIMITATIONS OF CURRENT TECHNIQUES</head><p>For the types of algorithms that we consider, there are two main existing techniques that prune graphs. They are Sparsification <ref type="bibr">[3,</ref><ref type="bibr">15]</ref> and K-core <ref type="bibr">[13,</ref><ref type="bibr">21,</ref><ref type="bibr">35]</ref>. These techniques have a few limitations, which act as a motivation for our work. In this section, we discuss these limitations. Later, in Section 7.1, we improve the two techniques and, in Section 8, evaluate them against V-Combiner. Sparsification selects some edges in the graph using a probabilisic method and prunes them from the graph. It does not prune any vertices. The pruned edges are chosen based on a sparsification parameter, the shape of the graph, and some properties of the edges. The pruning can be made more or less aggressive. Details are provided in Section 7.1.</p><p>Since only edges are removed and not vertices, the algorithm still has to loop over all the vertices. In addition, by removing edges, it typically increases the length of the paths between vertices. As a result, the number of iterations that a graph algorithm requires to converge typically increases. For these reasons, it may be difficult to obtain a high speedup with a high algorithm accuracy.</p><p>K-core takes a graph and removes as many vertices as needed so that the remaining vertices have a degree equal or greater than k. The result is the k-core graph. This technique prunes both vertices and edges. Higher values of k imply that more vertices and edges are dropped.</p><p>Since both edges and vertices are removed, this technique can deliver higher speedups. It reduces the work without increasing the number of iterations until convergence. However, the downside is that, as k increases, more and more of the vertices remaining in the graph become disconnected. As the graph loses connectivity, the accuracy of the algorithm suffers.</p><p>As an illustration, Figures <ref type="figure">3a</ref> and<ref type="figure">3b</ref> show representative scenarios. They correspond to runs of Page Rank (PR) using sparsification (Figure <ref type="figure">3a</ref>) and Community Detection (CD) using k-core (Figure <ref type="figure">3b</ref>). Section 7 describes the input graph used (PLD), the parameters used, and the 44-core machine running the experiments. The figures show, for different levels of pruning (X axis, where the level of pruning increases toward the right side), the accuracy as defined in Section 7 (left Y axis) and the speedup relative to the execution using the full graph (right Y axis). 0.9 0.7 0.5 0. We observe that, as the extent of pruning increases, the speedup increases, but the accuracy decreases substantially. For these applications and input graph, we see that even with a small degree of pruning (left side of X axis), the accuracy is already 90% or lower and, for k-core, the speedup is minuscule.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">VERTEX MERGING WITH V-COMBINER</head><p>Based on the previous discussion, an effective approach to reduce the graph size needs to involve removing both edges and vertices. In addition, as we remove them, we have to be careful to eliminate (or at least reduce to a minimum) the possibility of disconnecting parts of the graph. Consequently, our proposal is to merge some vertices into nearby hubs, adding some of the edges of the removed vertices to the hub. The result is both accuracy and speedups.</p><p>Our technique is called V-Combiner. V-Combiner proceeds in two steps. First, it runs a vertex merging algorithm during graph preprocessing time, to identify vertices that have few connections and can be merged into their neighboring hubs. Then, after the application completes execution, it runs a correction or recovery algorithm to quickly obtain acceptable values for the merged vertices.</p><p>Figure <ref type="figure">4</ref> shows the end-to-end execution timeline for both the conventional exact approach and V-Combiner. The exact approach includes pre-processing time (i.e., when the graph is built) and compute time (i.e., when the graph algorithm runs). The V-Combiner approach contains pre-processing time (which includes vertex merging and building the graph), compute time (i.e., when the graph algorithm runs), and post-processing time (i.e., when the recovery takes place). The shaded boxes are V-Combiner-specific steps.  The performance benefits of V-Combiner come from the reduced algorithm time due to the reduced number of vertices and edges. More specifically, the graph algorithm can converge with fewer iterations (because of the reduced vertices) and fewer update operations in each iteration (because of the reduced edges). Ideally, the reduction in algorithm time should more than offset the additional overheads of the pre-and post-processing steps. Support for Different Graphs and Algorithms. V-Combiner can be applied to both directed and undirected graphs with minimal changes. Moreover, it can be used for a variety of graph algorithms with different types of update propagation. For example, in Page Rank, updates are always propagated from in-neighbors, whereas in HITS <ref type="bibr">[22]</ref>, updates are propagated both from in-and out-neighbors. Based on the update propagation, we classify algorithms into oneway propagation and two-way propagation. For undirected graphs, the propagation is naturally two-way. When merging, V-Combiner takes into account the direction of update propagation used by the algorithm, and tries to maintain the connectivity of the graph. The goal is for the updates to continue to propagate through the edges even after certain vertices are removed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Merging Approach</head><p>During the merging step, V-Combiner proceeds to identify a subset of the high-degree vertices that we call Supernodes. Then, for each supernode, it considers the vertices in its input neighborhood (or in its output neighborhood, if the updates are propagated from out-neighbors). If these vertices have low in-and out-degrees, we say that they are Subnodes of the supernode. The assumption is that the contribution of such vertices to the final result of the algorithm is likely to be small. Hence, the subnodes are merged into the supernode, effectively being pruned away. The supernode takes some of the edges of the merged subnodes, as explained in Section 5.2.</p><p>The resulting graph has a set of supernodes with now higher degrees, connected with what we call Regular vertices. The resulting graph has a similar structure to the original one, but with many fewer vertices and edges.</p><p>We now give the precise definitions of supernodes, subnodes, and regular vertices in a given graph G. In the definitions, we use three thresholds called &#945;, &#946;, and &#960; . The &#945; and &#946; thresholds are the lower and upper thresholds, respectively, for the degree of a vertex that qualifies as a supernode. The &#960; threshold is the upper threshold for the degree of a vertex that qualifies as a subnode. We describe how to obtain such thresholds in Section 5.6. Definition 1: Supernodes (V sup ) are vertices with an in-degree higher than &#945; and lower than &#946;,</p><p>Vertices with in-degree higher than or equal to &#946; are not considered supernodes. This is because we do not want them to increase their degree further by taking-in edges from merged subnodes. Vertices with very high degree can cause load imbalance and slow down the execution. Hence, these vertices remain unchanged in the graph.</p><p>For two-way algorithms in directed graphs, supernodes are vertices where the sum of in-degree and out-degree is higher than &#945; and lower than &#946;. Further, a supernode is an in-supernode if its in-degree is higher than or equal to its out-degree, and an outsupernode if the opposite is true.</p><p>For undirected graphs, supernodes are vertices with a degree between &#945; and &#946;. Definition 2: Subnodes (V sub ) are vertices that have an in-degree lower than &#960; (where &#960; &lt; &#945;), an out-degree lower than &#960; , and at least one output that connects them to a supernode,</p><p>A vertex with an in-degree or an out-degree higher than &#960; is not a subnode because it is too important to be merged into a nearby supernode. Also, it is possible that a subnode is connected to two supernodes; in this case, it can be merged into either with a deterministic algorithm.</p><p>For two-way algorithms in directed graphs, subnodes are vertices where the sum of in-degree and out-degree is less than &#960; , and have at least one edge that connects them to a supernode. Further, a subnode is an in-subnode if it has at least one output that connects it to an in-supernode, and is an out-subnode if it has at least one input that connects it to an out-supernode. A subnode can be both an in-subnode and an out-subnode.</p><p>For undirected graphs, subnodes are vertices with a degree lower than &#960; and at least one edge that connects them to a supernode. Definition 3: Regular vertices (V reg ) are the remaining vertices,</p><p>V-Combiner leaves those vertices intact.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Vertex Merging Algorithm</head><p>V-Combiner performs vertex merging to generate the approximate graph. The algorithm involves merging each subnode into a neighboring supernode. In the following, without loss of generality, we describe the vertex merging algorithm assuming one-way graph algorithms in directed graphs where updates are propagated from in-neighbors. In Section 5.5, we describe scenarios with other types of directed and undirected graphs.</p><p>Merging a subnode into a supernode involves removing the subnode and altering the edges as follows. First, the input edges of the subnode now become input edges of the supernode. This operation may create duplicated edges -i.e., multiple edges connecting the same input vertex to the same output vertex; such duplication will be later eliminated when the graph is built. Second, the output edges of the subnode are dropped. V-Combiner chooses this design over a possible alternative where the output edges of the subnode become output edges of the supernode. Such alternative is undesirable because it could create a new, previously-nonexistent path.</p><p>Figure <ref type="figure">5</ref> shows an example of vertex merging. Figure <ref type="figure">5</ref>(a) shows an original subgraph, where V-Combiner wants to merge subnode 2 &#9675; into supernode 3  &#9675;.  As V-Combiner merges subnode 2 This array has as many entries as vertices in the original graph. If a vertex has become a subnode, its entry in superNodes has the ID of its supernode; if a vertex has become a supernode, its entry in superNodes has its own ID; for the remaining vertices, the entry in superNodes holds -1. The output of the algorithm will be later used in the step that builds the graph. The merging algorithm consists of three sections. Each section can be executed in parallel:</p><p>&#8226; Identify supernodes: First, all vertices are scanned in parallel to identify any vertex that has an in-degree higher than &#945; and lower than &#946; (Lines 1-3). &#8226; Identify subnodes: The second parallel section (Lines 4-10) consists of processing the list of edges in parallel. For every edge e, such Algorithm 2: Merging algorithm in V-Combiner.</p><p>inputs : N (number of vertices), edges (list of edges), inDegrees, outDegrees, &#945; , &#946; , &#960; outputs : newInDegrees, newOutDegrees, superNodes that e.dst is a supernode, we check if e.src qualifies to become a subnode. If so, the superNodes entry of e.src is set to e.dst. Also, we prune the in-and out-edges of e.src. Therefore, the new inand out-degrees of e.src are set to 0.</p><p>&#8226; Merge Vertices and Update Degrees: Finally in Lines 11-15, the algorithm computes the new in-and out-degrees of all affected vertices. Recall that merging affects both the in-neighbors and the out-neighbors of the subnode. Each in-neighbor has to be connected to the supernode. Hence, we increase the in-degree of the supernode in Lines 12-13. Moreover, the out-edges of the subnode have to be eliminated, and the out-neighbors have to update their in-degrees (Lines 14-15). Delta Graph Construction. After merging, when the graph is built, V-Combiner also constructs a small graph named the Delta graph. The Delta graph contains only the subnodes and their immediate in-neighbors. The Delta graph will be used to generate good final values for the subnodes in the recovery step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Recovery Step</head><p>Recall that the subnodes do not have any values after the graph algorithm completes. The goal of V-Combiner's recovery step is to assign the final values to these subnodes.</p><p>To compute the values of the subnodes, the recovery algorithm takes the Delta graph and proceeds as follows. The in-neighbors of subnodes in the Delta graph are given the values computed by the computation on the approximate graph. Such values are close to their exact values. Then, we run the recovery algorithm, which is specified by the developer. It simply uses the basic operator of the corresponding graph algorithm to generate the values of the subnodes using the values of their in-neighbors in the Delta graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Overall V-Combiner Algorithm</head><p>Algorithm 3 shows the overall V-Combiner algorithm. First, the subnode vertices are merged using Algorithm 2. Then, the output of the Merging algorithm, including the superNodes array, is passed to a Build step. This step constructs both the Delta and the approximate graphs (Line 2). After that, the graph algorithm, outlined in Algorithm 1 executes on the approximate graph using the Algorithm 3: Overall V-Combiner algorithm.</p><p>input : merging_arguments, algo_params output : final_results user-specified algorithm parameters (Line 3), and returns its results. Finally, in Line 4, the recovery algorithm receives these results and recovers the values for the subnode vertices using the Delta graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Other Scenarios of the Merging Algorithm</head><p>Section 5.2 only described how to perform merging for the most common scenario -i.e. directed graphs and one-way algorithms.</p><p>In the example algorithm, V-Combiner picks subnodes from the inputs of the supernode, and merges subnodes into supernodes in the forward direction of the edges. In the case where information flows only in the reverse direction of the edges in a directed graph, V-Combiner would pick subnodes from the outputs of the supernode. However, we are unaware of an example for this scenario.</p><p>Table <ref type="table">1</ref> shows a taxonomy with two other possible scenarios that V-Combiner can support: (i) directed edges where the information flows in both directions; (ii) undirected edges. V-Combiner supports both scenarios using the same merging techniques. Directed graphs and two-way algorithms. Since the goal of V-Combiner is to preserve connectivity in the direction of the update propagation, in this case where updates propagate in both directions, V-Combiner merges subnodes into supernodes in both directions. Recall from Section 5.1 that supernodes are vertices where the sum of in-degree and out-degree is higher than &#945; and lower than &#946;, and that subnodes are vertices where the sum of in-degree and out-degree is less than &#960; , and have at least one edge that connects them to a supernode. We further classified supernodes into in-supernodes and out-supernodes, and subnodes into in-subnodes and out-subnodes. Intuitively, in-supernodes are better connected through their inputs (compared to their outputs), and out-supernodes are better connected through their outputs (compared to their inputs). In these graphs and algorithms, V-Combiner merges in-subnodes into in-supernodes, and out-subnodes into outsupernodes. We call the first kind of merging forward merging and the second kind reverse merging. Figure <ref type="figure">6</ref> shows an example of reverse merging, where out-subnode   &#9675; into out-supernode<ref type="foot">foot_2</ref> &#9675;.</p><p>Intuitively, merging in both directions provides better connectivity than merging only in one direction. We perform forward merging of subnodes into supernodes that have better connectivity through their inputs, and reverse merging of subnodes into supernodes that have better connectivity through their outputs. Undirected graphs. Recall from Section 5.1 that supernodes are vertices whose degree is higher than &#945; and lower than &#946;, and that subnodes are vertices whose degree is less than &#960; , and have at least one edge that connects them to a supernode. In this algorithm, when a subnode is merged into a supernode, all the edges of the subnode are added to the supernode. V-Combiner does not drop any edges because paths do not have a specific direction. Duplicate edges are removed. This operation is also known as vertex-contraction <ref type="bibr">[20]</ref>.</p><p>Figure <ref type="figure">7</ref> shows how subnode 2</p><p>&#9675; is merged into supernode 3  &#9675; and its edges are also added to 3</p><p>&#9675;. All the edges of the former are added to the latter.  &#9675; into supernode 3 &#9675;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Choosing the Merging Parameters</head><p>To effectively reduce a graph, V-Combiner needs to choose a good set of supernodes. The number of supernodes is directly controlled by thresholds &#945; and &#946;. To find good values of &#945; and &#946;, we perform the following experiment. We first rank the vertices based on their in-degree, from lower to higher. We then accumulate the number of edges of the vertices, in order. Figure <ref type="figure">8</ref> shows the resulting Cumulative Density Function (CDF) of edges as a function of the in-degrees of vertices in the Friendster graph <ref type="bibr">[24]</ref>. We divide the plot into three regions. In the leftmost region, the curve has a sharp slope. This part accounts for the majority of the edges. These edges connect many vertices with small in-degrees. In the second region, the curve goes through the knee. Finally, in the third region, the curve flattens up. We argue that supernodes should be chosen from the knee region. This region has many vertices with substantial, but not excessive, in-degrees. Supernodes should not be chosen from the third region. In this region, vertices have very large in-degrees. Increasing their in-degrees further is likely to cause load imbalance.</p><p>In Section 8.5, we study different ranges for the knee region, to pick the most profitable one. The boundaries of such a region are the values of &#945; and &#946;.</p><p>Once we have picked the values of &#945; and &#946;, we need to pick a value of &#960; . Higher values of &#960; mean that more vertices qualify as subnodes. More subnodes typically translates into lower accuracy, because the elimination of subnodes with large in-and out-degree affects many neighboring vertices. However, more subnodes also translates into more time savings in processing the graph. Therefore, given &#945; and &#946; values, &#960; can be considered as a knob to directly control the trade-off between accuracy and speedup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">COMPARISON OF TECHNIQUES</head><p>Table <ref type="table">2</ref> qualitatively compares sparsification, k-core, and V-Combiner in terms of their impact on connectivity, length of paths, load balancing, pre-and post-processing overhead, and overall speedup. </p><p>Connectivity. To attain high accuracy, the reduced graph should strive to maintain connectivity between the remaining vertices. As discussed in Section 4, aggressive k-core easily ends-up causing graph disconnectivity. Sparsification and V-Combiner do not. V-Combiner minimizes disconnectivity by merging a subnode into a hub, and making the in-neighbors of the subnode to become in-neighbors of the hub. Average Length of Paths. Retaining the length of the paths between vertices in the reduced graph is sometimes important for performance and accuracy. Specifically, if paths are longer, graph algorithms typically take more iterations to converge. Furthermore, in some algorithms, changing the length of the paths causes distortions that result in low accuracy. Generally, all these three algorithms change the average length of the paths. As discussed in Section 4, since sparsification removes edges, it typically increases the length of the paths between vertices. Since k-core and V-Combiner remove vertices, they tend to reduce the average length of the paths. We observe, however, that V-Combiner is the one that changes the average length of the paths the least. Intuitively, this is because it includes two opposite effects: it reduces the length of paths by removing some vertices, but increases the length of paths by dropping some edges. Load Balancing. To attain high speedups, graph algorithms should keep a good balance of the load between threads. As shown in Algorithm 1, a graph algorithm can be parallelized by assigning a subset of the vertices to each thread to process. The amount of work performed per vertex is proportional to its degree. It is well known that many graphs present a very skewed distribution of the vertices: there are many small-degree vertices and few high-degree ones. We observe that both k-core and V-Combiner improve load balancing by making the distribution less skewed. Specifically, they remove many of the small-degree vertices, either by pruning (in k-core) or by merging (in V-Combiner). Sparsification largely keeps the same distribution.</p><p>Pre-and Post-Processing Overhead. To attain good speedups, it is important that the graph reduction techniques have minimal impact on the pre-and post-processing steps, including the building of the graph. Sparsification has the least pre-processing overhead. k-core has the largest pre-processing overhead: even though we use the state-of-the-art k-core implementation <ref type="bibr">[6]</ref>, pruning until the remaining vertices have a degree of at least k has substantial overhead. The overhead of the pre-and post-processing in V-Combiner is higher than in sparsification. We show the numbers in Section 8. Overall Speedup. Based on all the factors considered, the last column of Table <ref type="table">2</ref> outlines the expected relative performance of the techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">EXPERIMENTAL SETUP</head><p>We implement V-Combiner in C++ using OpenMP. We use the Page Rank code from GAP <ref type="bibr">[4]</ref>, and implement Community Detection <ref type="bibr">[26,</ref><ref type="bibr">45]</ref>, Belief Propagation <ref type="bibr">[18]</ref>, and Hyperlink-Induced Topic Search <ref type="bibr">[22]</ref> ourselves.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Methodology</head><p>Machine Specification and Graph Datasets. To evaluate the effectiveness of V-Combiner and the other techniques, we perform experiments on a 2-socket shared-memory system with 44 Intel Xeon Gold 6152 cores and 192GB of memory. We disable the Dynamic Voltage and Frequency Scaling (DVFS) mechanism to minimize run-time variation, and use the numactl library with the interleave all option to average out the numa effects on the programs. Table <ref type="table">3</ref> shows our graph datasets, which are chosen from several different domains. The Belief Propagation (BP) benchmark uses undirected graphs. Unfortunately, we could not run BP on our largest two graphs, namely FS and TW, as their memory requirements exceed the machine's memory capacity. Table <ref type="table">3</ref>: Graph datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Graph dataset Vertices Directed Edges Undirected Edges</head><p>Friendster (FS) <ref type="bibr">[24]</ref> 65.6M 1715.7M -Twitter (TW) <ref type="bibr">[27]</ref> 61.5M 1456.1M -Page-Level Domain (PLD) <ref type="bibr">[31]</ref> 42.9M 623.1M 582.6M Arabic-2005 (AR) <ref type="bibr">[5]</ref> 22.7M 631.2M 553.9M Dbpedia (DB) <ref type="bibr">[24]</ref> 18.3M 136.5M 126.9M</p><p>Evaluation Configurations. We perform end-to-end execution time analysis of our benchmarks. We run each benchmark using both exact and approximate graphs in several different configurations. For each configuration, we measure the time spent in each step of execution by averaging out the results of five runs. The following steps are measured:</p><p>&#8226; Prune/Merge. Identify the vertices/edges to prune or merge.</p><p>&#8226; Build. Construct the approximate (and Delta) graph.</p><p>&#8226; Algorithm. Execute the graph algorithm.</p><p>&#8226; Recovery. Generate the values of the eliminated vertices (V-Combiner and k-core only). We evaluate the following configurations: &#8226; Exact. This is the baseline execution, which is running on the unmodified (original) graph dataset. All the approximate executions are compared to this baseline. The output of this execution is also used as the ground truth for evaluating accuracy.</p><p>&#8226; Sparsification. Sparsification prunes one of many edges from vertices with large degrees. For each edge (u, v), it calculates the probability of keeping it using the formula</p><p>Here, S is the Sparsification parameter, which is suggested to be in the interval [0.1, 0.9] <ref type="bibr">[15]</ref>. Also, d av&#1076; is the average degree of the graph, defined as the ratio of the number of edges over the number of vertices. d u o and d v i are the out-degree and in-degree of the vertices at the two ends of the edge, respectively. The denominator is the minimum of the two degrees. The equation applies to undirected graphs too. Unlike V-Combiner, sparsification does not require a recovery step, since the values for all the vertices are computed on the approximate graph. Optimization: Figure <ref type="figure">3a</ref> shows that the accuracy achieved by sparsification at 0.9 sparsification is 83%. This low accuracy is because a significant number of edges were dropped. To fix this problem, in our evaluation in Section 8, we constrain sparsification using a second parameter (Percent_E_Remain), which enforces that a certain fraction of edges remain in the approximate graph. Sweeping parameters: We use three sparsification parameter s values (0.9, 0.7, and 0.5), each with 5 different Percent_E_Remain parameter values (90%, 80%, 70%, 60% and 50%).</p><p>&#8226; K-core. We follow the state-of-the-art implementation of kcore <ref type="bibr">[6]</ref> to identify all the vertices that have to be dropped given a certain k. When a vertex is dropped, all of its in-and out-edges are dropped too.</p><p>Optimization: In our evaluation in Section 8, we add a recovery step to assign values to the removed vertices. This is similar to the algorithm described in Section 5.3 for V-Combiner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sweeping parameters:</head><p>We try k values equal to 2, 5, 10, 15, 20, 25, 30, 40, and 50. We only consider k-core configurations that have a percentage of remaining edges over 50%. &#8226; V-Combiner. V-Combiner uses a combination of pruning and merging, unlike sparsification and k-core, which are pruning-only.</p><p>Sweeping parameters: The (&#945;, &#946;) interval is selected to create a subrange of edge CDF in the curve of Figure <ref type="figure">8</ref> within the (65%, 95%) interval. This corresponds to the knee region of the curve. For all benchmarks except HITS, we consider intervals of length 10% each, i.e. (65%, 75%), (75%, 85%) and (85%, 95%). HITS is a two-way algorithm with directed edges and, therefore, there is more merging and edge dropping. For this reason, we experiment with half-sized intervals for HITs, i.e. (65%, 70%),... , (90%, 95%).</p><p>Given an interval, we sweep the value of &#960; so that the CDF of edges ranges from 5% to CDF(&#945;)%-5%. Finally, we only consider V-Combiner configurations that have a percentage of remaining edges over 50%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Accuracy Metrics</head><p>We target an accuracy threshold of 90% for all the benchmarks, which is a common threshold used by past work <ref type="bibr">[8,</ref><ref type="bibr">39]</ref>. CD. In CD, each vertex will be assigned a label, indicating the probability of that vertex to belong to a specific community. To measure accuracy, we compute the fraction of initially-unlabeled vertices that end-up being identified to belong to the correct community.</p><p>We determined the correct community by the exact computation on the original graph. HITS and PR. We measure the top-k accuracy <ref type="bibr">[32]</ref>, which is the fraction of the vertices in the top k ranks of the exact output that are also in the top k ranks of the approximate output. k is applicationdependent, and is given relative to the number of graph's vertices. BP. Based on past work <ref type="bibr">[11,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">19,</ref><ref type="bibr">44]</ref>, we divide BP use cases into two main categories: (i) classification of vertices based on the class they belong to, and (ii) ranking of the vertices based on information such as user trustworthiness and influence. For the classification scenario, since we are only able to run the algorithm for belief vectors of size 2 (due to not enough memory), we observe high accuracy in most cases. Therefore, we choose the ranking scenario, where we can capture the accuracy better. We use the top-k accuracy metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">EVALUATION</head><p>We first compare V-Combiner's performance-accuracy trade-offs to sparsification's and k-core's, both for algorithm-time (Section 8.1) and for end-to-end time (Section 8.2). We then present statistics on the approximate graphs (Sections 8.3 and 8.4), and an analysis of the best pruning or merging parameters (Section 8.5).</p><p>To compare V-Combiner, sparsification, and k-core, we pick their best parameter configurations from Section 7.1 that, for each individual benchmark and input graph, deliver the highest end-toend speedup over the baseline. The accuracy is required to be equal to or higher than the 90% threshold. We call these configurations the best end-to-end configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Algorithm Performance and Accuracy</head><p>Best Design Points. In this section, to understand the trade-offs between the different algorithms, we take the best end-to-end configurations but recompute the speedups without considering the pre-processing time. The "pruning" or "merging" part of the preprocessing time is highly variable across different techniques and overshadows the savings obtained by reducing algorithm time. However, we do consider the post-processing time since it contributes to the final accuracy in both V-Combiner and k-core. In Section 8.2, we will analyze the speedups of the best end-to-end configurations with all the times included.</p><p>Table <ref type="table">4</ref> shows the resulting algorithm execution time (time), and the speedup (sp), accuracy (acc), and percentage of work done (work) relative to the baseline for the three techniques. The amount of work done is proportional to the number of edges multiplied by the number of iterations. We also report the "expected" speedup (sp) next to each speedup number, which is computed as 100  wor k(%) . It largely indicates the speedup that would be achieved without considering any load imbalance effects in the approximate graph.</p><p>In the table, each row corresponds to one benchmark and graph. In each row, the technique with the highest speedup is in bold. An empty row means that the technique could not run the benchmark and graph with an accuracy equal to or higher than the threshold.</p><p>Comparing V-Combiner to sparsification. V-Combiner attains an average speedup of 1.55&#215;, while sparsification attains 1.46&#215;. Surprisingly, this happens regardless of the fact that sparsification performs less work on average -i.e. 64.9% compared to 67.6%. The reason is the better load balancing of the work in V-Combiner due to the merging of small vertices. We observe that, on average in V-Combiner, the speedup is 4.7% higher than the expected speedup, while it is 5.2% lower than the expected speedup in sparsification. Additionally, V-Combiner satisfies the accuracy threshold across all the benchmarks, while sparsification fails to meet the accuracy threshold in HITS-PLD.</p><p>V-Combiner ensures high speedups by dropping both vertices and edges, as compared to sparsification which only drops edges. We can observe the higher speedups in all of PR and HITS experiments (except PR-AR) and some of BP and CD experiments, i.e. BP-AR, CD-TW, and CD-DB.</p><p>Comparing V-Combiner to k-core. k-core obtains a significantly lower average speedup than V-Combiner, i.e. 1.36&#215;. This is because k-core performs more work than V-Combiner (81.3%) to reach an accuracy of above the 90% threshold. Furthermore, k-core fails to satisfy the accuracy threshold in four experiments. Another interesting observation is that both V-Combiner and k-core achieve average higher speedups than their expected speedups. This is because both techniques perform small-degree vertex pruning/merging, which helps load balancing the work better than the original graph.</p><p>Compared to k-core, V-Combiner relies on the high connectivity of its approximate graph and recovery algorithm to achieve higher speedup-accuracy operating points. The recovery algorithm is more effective in V-Combiner than in k-core. k-core's recovery is performed using other removed vertices with their initial values, while V-Combiner's recovery is performed using vertices that have been already involved in the computation. Except in HITS-FS, HITS-PLD, HITS-AR, and HITS-DB, V-Combiner achieves comparable or higher speedups than k-core. For these graphs, k-core performs less work with higher accuracy.</p><p>Overall, V-Combiner attains the highest average speedup in 8 out of 18 benchmarks. The reasons are better load balancing and the preservation of the average length of paths (relative to sparsification), and performing less work at high accuracy due to better connectivity (relative to k-core).</p><p>Speedup-Accuracy Trends. To further compare V-Combiner, sparsification, and k-core, Figure <ref type="figure">9</ref> shows pairs of (speedup, accuracy) for the techniques. In the charts, the X-axis shows accuracy from 0.85 (low) and 1.00 (high), and the Y-axis shows speedup over the baseline algorithm. We obtain these points by sweeping the parameters of each technique according to Section 7.1. As before, we include the post-processing time but not the pre-processing time, since we focus on the algorithm speedup only. Note that some of the data points in Figure <ref type="figure">9</ref> show higher speedups than in Table <ref type="table">4</ref>. This is because Table <ref type="table">4</ref> only shows the best end-to-end configurations.</p><p>Comparing V-Combiner to sparsification. In the figure, the closer the (speedup, accuracy) pairs are to the top right of each plot, the better the trade-off is. For all HITS and PR experiments except PR-AR, V-Combiner achieves better results than sparsification. Typically in PR, sparsification is more likely to achieve better trade-offs on more "skewed" graphs -i.e. graphs that have many small-degree vertices and few extraordinarily-large degree vertices. In such input graphs, there is not much room to merge subnodes without creating disconnectivity, since those large vertices are not picked as supernodes (according to Algorithm 2). However, in practice, most real-world graphs are less skewed and thus more suitable for V-Combiner. For BP, both schemes have similar results. Finally, for CD, sparsification generally gets better results, especially for highly dense graphs such as CD-FS. In contrast, V-Combiner achieves better results in sparser graphs such as CD-TW.</p><p>Comparing V-Combiner to k-core. Figure <ref type="figure">9</ref> shows that in most experiments, V-Combiner exhibits better performance-accuracy trade-offs than k-core. The exceptions are HITS-FS, HITS-PLD, HITS-AR, and HITS-DB, for the reason indicated before. However, we will see in Section 8.2 that the end-to-end time of k-core is significantly affected by the pre-processing overhead.  and (iv) post-processing or recovering. From the figure, we see that the prune/merge step is cheap in V-Combiner and sparsification. However, it is very expensive in most of the k-core experiments: even though we use the state-of-the-art k-core implementation <ref type="bibr">[6]</ref>, pruning until the remaining vertices have a degree of at least k has substantial overhead. The Build step takes, on average, 20%, 22.6%, 16.4%, and (not shown) 19.7% of the total Exact time, in Exact, V-Combiner, sparsification, and k-core, respectively. The Build time in V-Combiner is higher than in sparsification because, in this step, V-Combiner performs the actual vertex merging and generates the Delta graph. (Recall that, in the merge step, V-Combiner executes Algorithm 2, which computes the new vertex degrees). However, the algorithm time is often shorter in V-Combiner than in sparsification, as we saw in Table <ref type="table">4</ref>. The recovery overheads are negligible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">End-to-End Analysis</head><p>Overall, we see that V-Combiner obtains an average end-to-end speedup of 1.25&#215; over the baseline across the 18 benchmarks. It does so with an average accuracy of 91.8% (Table <ref type="table">4</ref>). Sparsification typically has a slower algorithm but a shorter pre-processing time. The resulting average end-to-end speedup of sparsification over baseline is like V-Combiner in Figure <ref type="figure">10</ref>. However, one of the benchmark-graph pairs (HITS-PLD) does not reach the required accuracy in sparsification. Specifically, it can be shown that HITS-PLD only reaches 71% accuracy in sparsification. Such experiment is not included in the average sparsification bar of Figure <ref type="figure">10</ref>. As a result, V-Combiner is preferable over sparsification. Finally, K-core exhibits a substantial slowdown over baseline on average due to its large pruning overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">Analysis of the Connectivity</head><p>Table <ref type="table">5</ref> shows pruning or merging statistics for different techniques with the configurations of Table <ref type="table">4</ref>. For V-Combiner, we show the breakdown of the vertices in the original graph into supernodes, subnodes, and regular vertices. On average, the fraction of supernodes, subnodes, and regular vertices is 0.1%, 18.5%, and 81.4%, respectively. The percentage of vertices that remain in the approximate graph is equal to the percentage of supernodes plus regular vertices. On average, it is 81.5%. The percentage of edges remaining in the V-Combiner approximate graph is 71.4% on average.</p><p>For sparsification, the table shows the percentage of remaining edges (since no vertex is removed). On average, such number is 66.0%. Finally, for k-core, the table shows the percentages of remaining vertices and edges. On average, they are 54.4% and 87.6%. V-Combiner keeps more remaining vertices in the approximate graph than k-core. This explains why V-Combiner provides better connectivity than k-core. We also see that, on average, the techniques generate approximate graphs with 65%-90% of the edges, which provide the most profitable performance-accuracy trade-off. Finally, sparsification has the lowest percentage of edges remaining. Because it does not remove vertices, it can tolerate dropping so many edges while still preserving good connectivity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.4">Analysis of the Average Length of Paths</head><p>Table <ref type="table">6</ref> compares the average length of the paths in each benchmarkgraph pair in the original and approximate graphs. To compute the average path length in each original graph, we select and run 10,000 shortest path queries and average out all the shortest path distances. Next, we use the same queries to measure the average length of the paths in the approximate graphs that the different techniques generated using the configurations of Table <ref type="table">4</ref>. Finally, we compute the error as the difference between the values in the original and approximate graphs, as a percentage.</p><p>V-Combiner preserves the average length of paths much more than sparsification or k-core. On average, the error in average path length in V-Combiner is only 3.8%, while it is 30.1% and 37.6% in sparsification and k-core, respectively. Generally, k-core reduces the average length of the paths, while sparsification increases it.</p><p>The reason is that k-core prunes vertices and therefore reduces the number of hops to go from one vertex to another. Moreover, k-core often disconnects parts of the graph, causing longer paths to disappear, and the average path length to decrease. In contrast, sparsification increases the number of hops between vertices because it reduces the connections between the vertices by removing edges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.5">Analysis of Pruning/Merging Parameters</head><p>Table <ref type="table">7</ref> shows the pruning or merging parameters that we use to generate the best end-to-end configurations for each benchmarkgraph pair and technique. In V-Combiner, the parameters are the supernode thresholds &#945; and &#946;, and the subnode threshold &#960; . The table shows these parameters as the corresponding values in the CDF of number of edges (Figure <ref type="figure">8</ref>). We observe that, for CD and BP, the best (&#945;, &#946;) values are mostly (85%, 95%). For HITS, the best values are mostly (65%, 70%), and for PR, they are mostly (75%, 85%). The &#960; parameter shows more variation, as it generally ranges between 60% and 20%. In contrast, for sparsification and k-core, we do not see a clear trend on how to set the s and k parameters; hence a full sweep of parameters is required. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">RELATED WORK</head><p>Previous research focused on algorithm-specific approximations such as FrogWild <ref type="bibr">[32]</ref> for Page Rank and approximate K-level asynchronous Breadth-First Search <ref type="bibr">[9]</ref>. While these approximations are effective for one algorithm, they often lack the generality to be applied to a broad range of algorithms. More general approaches include graph summarization techniques <ref type="bibr">[1,</ref><ref type="bibr">3,</ref><ref type="bibr">7,</ref><ref type="bibr">21,</ref><ref type="bibr">28,</ref><ref type="bibr">30,</ref><ref type="bibr">37]</ref>, and other graph processing approximate frameworks <ref type="bibr">[25,</ref><ref type="bibr">33,</ref><ref type="bibr">38]</ref>, or general-purpose frameworks that can be applied to graph processing domains too <ref type="bibr">[12,</ref><ref type="bibr">40]</ref>. Compared to the general-purpose frameworks, V-Combiner is deterministic and provides application transparency.</p><p>Graph summarization techniques vary widely from algorithmspecific such as sketching algorithms <ref type="bibr">[1,</ref><ref type="bibr">7,</ref><ref type="bibr">30,</ref><ref type="bibr">37]</ref> to more generalpurpose such as sparsification <ref type="bibr">[3,</ref><ref type="bibr">15]</ref> and k-core <ref type="bibr">[21]</ref>. The original proposal of sparsification <ref type="bibr">[3]</ref> provides bounds for the Page Rank algorithm. However, it is not applicable in practice. First, it requires the graph input to be undirected, while most of the real-world graphs are directed. Second, it requires the knowledge of the graph eigenvalues, which will take much higher time to compute than the actual Page Rank values. A practical implementation of sparsification <ref type="bibr">[15]</ref> replaced the eigenvalues with a tunable sparsification parameter and average degree of the graph that provides end-to-end performance gains, but with no accuracy guarantees.</p><p>The idea behind sketching algorithms is to summarize the graph information tailored to a specific application into a data structure that can be queried later to return a fast approximate answer after simple computations. Unfortunately, while sketching techniques provide high accuracy, their end-to-end performance is much worse than the exact baseline due to their high pre-processing overheads. V-Combiner has lower overhead and hence provides higher endto-end performance at high accuracy. In contrast to some popular sketching algorithms that are tailored to specific graph algorithms, V-Combiner can be applied to a wide variety of graph algorithms, without modifying their code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">CONCLUSION</head><p>Fast graph processing has an important role in many applications where a small level of inaccuracy is acceptable. To speed-up iterative graph processing algorithms, we propose to merge certain graph vertices into hub vertices next to them -i.e. vertices with many connections. Our novel scheme, V-Combiner, systematically and deterministically constructs an approximate graph using pruning and merging. It also includes an inexpensive correction step after the graph algorithm executes, to recover the contribution of the pruned vertices. The result is faster execution at acceptable accuracy levels.</p><p>We evaluated V-Combiner on a 44-core shared-memory platform. On average across 4 applications and 5 graph datasets, V-Combiner attained an end-to-end speed-up of 1.25&#215; over the exact baseline system, with an accuracy of 91.8%. We also showed that V-Combiner provides an overall better performance-accuracy trade-off than the sparsification and k-core techniques.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>merging_output = Merging(merging_arguments);</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>approx_graph, delta_graph = Build(merging_output);</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>results = GraphAlgo(approx_graph, algo_params);</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>final_results = Recovery(results, delta_graph, algo_params);</p></note>
		</body>
		</text>
</TEI>
