We present a weighted approach to compute a maximum cardinality matching in an arbitrary bipartite graph. Our main result is a new algorithm that takes as input a weighted bipartite graph G(A cup B,E) with edge weights of 0 or 1. Let w <= n be an upper bound on the weight of any matching in G. Consider the subgraph induced by all the edges of G with a weight 0. Suppose every connected component in this subgraph has O(r) vertices and O(mr/n) edges. We present an algorithm to compute a maximum cardinality matching in G in O~(m(sqrt{w} + sqrt{r} + wr/n)) time. When all the edge weights are 1 (symmetrically when all weights are 0), our algorithm will be identical to the wellknown HopcroftKarp (HK) algorithm, which runs in O(m sqrt{n}) time. However, if we can carefully assign weights of 0 and 1 on its edges such that both w and r are sublinear in n and wr=O(n^{gamma}) for gamma < 3/2, then we can compute maximum cardinality matching in G in o(m sqrt{n}) time. Using our algorithm, we obtain a new O~(n^{4/3}/epsilon^4) time algorithm to compute an epsilonapproximate bottleneck matching of A,B subsetR^2 and an 1/(epsilon^{O(d)}}n^{1+(d1)/(2d1)}) poly log n time algorithm for computing epsilonapproximate bottleneck matching in ddimensions. All previous algorithms take Omega(n^{3/2}) time. Given any graph G(A cup B,E) that has an easily computable balanced vertex separator for every subgraph G'(V',E') of size V'^{delta}, for delta in [1/2,1), we can apply our algorithm to compute a maximum matching in O~(mn^{delta/1+delta}) time improving upon the O(m sqrt{n}) time taken by the HKAlgorithm.
more »
« less
Designing Scalable bMATCHING Algorithms on Distributed Memory Multiprocessors by Approximation
A bmatching is a subset of edges M such that at most b(v) edges in M are
incident on each vertex v, where b(v) is specified. We present a distributedmemory
parallel algorithm, \bsuitor, that computes a bmatching with more than
half the maximum weight in a graph with weights on the edges.
The approximation algorithm is designed to have high concurrency and low
time complexity. We organize the implementation of the algorithm
in terms of asynchronous supersteps that combine computation and
communication, and balance the computational work and frequency
of communication to obtain high performance.
Since the performance of the bsuitor algorithm is strongly influenced
by communication, we present several strategies to reduce the
communication volume. We implement the algorithm using a hybrid
strategy where internode communication uses MPI and intranode
computation is done with OpenMP threads.
We demonstrate strong and weak scaling of bsuitor up to
16,000 processors on two supercomputers at NERSC.
We compute a bmatching in a graph with 2 billion edges in
under 4 seconds using 16,000 processors.
more »
« less
 Award ID(s):
 1637534
 NSFPAR ID:
 10039359
 Date Published:
 Journal Name:
 Proceedings of ACM/IEEE Supercomputing Conference (SC16)
 Page Range / eLocation ID:
 773 to 783
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We describe a 3/2approximation algorithm, \lse, for computing a bedgecover of minimum weight in a graph with weights on the edges. The bedgecover problem is a generalization of the betterknown Edge Cover problem in graphs, where the objective is to choose a subset C of edges in the graph such that at least a specified number b(v) of edges in C are incident on each vertex v. In the weighted bedgecover problem, we minimize the sum of the weights of the edges in C. We prove that the Locally Subdominant edge (LSE) algorithm computes the same bedge cover as the one obtained by the Greedy algorithm for the problem. However, the Greedy algorithm requires edges to be sorted by their effective weights, and these weights need to be updated after each iteration. These requirements make the Greedy algorithm sequential and impractical for massive graphs. The LSE algorithm avoids the sorting step, and is amenable for parallelization. We implement the algorithm on a serial machine and compare its performance against a collection of approximation algorithms for the bedge cover problem. Our results show that the algorithm is 3 to 5 times faster than the Greedy algorithm on a serial processor. The approximate edge covers obtained by the LSE algorithm have weights greater by at most 17% of the optimal weight for problems where we could compute the latter. We also investigate the relationship between the bedge cover and the bmatching problems, show that the latter has a faster implementation since edge weights are static in this algorithm, and obtain a heuristic solution for the former from the latter.more » « less

Bender, M. ; Gilbert, J. ; Hendrickson, B. ; Sullivan, B. (Ed.)We design new serial and parallel approximation algorithms for computing a maximum weight bmatching in an edgeweighted graph with a submodular objective function. This problem is NPhard; the new algorithms have approximation ratio 1/3, and are relaxations of the Greedy algorithm that rely only on local information in the graph, making them parallelizable. We have designed and implemented Local Lazy Greedy algorithms for both serial and parallel computers. We have applied the approximate submodular bmatching algorithm to assign tasks to processors in the computation of Fock matrices in quantum chemistry on parallel computers. The assignment seeks to reduce the run time by balancing the computational load on the processors and bounding the number of messages that each processor sends. We show that the new assignment of tasks to processors provides a four fold speedup over the currently used assignment in the NWChemEx software on 8000 processors on the Summit supercomputer at Oak Ridge National Lab.more » « less

Stateoftheart synchronous graph processing frameworks face both inefficiency and imbalance issues that cause their performance to be suboptimal. These issues include the inefficiency of communication and the imbalanced graph computation/communication costs in an iteration. We propose to replace their conventional twosided communication model with the onesided counterpart. Accordingly, we design SHMEMGraph, an efficient and balanced graph processing framework that is formulated across a global memory space and takes advantage of the flexibility and efficiency of onesided communication for graph processing. Through an efficient onesided communication channel, SHMEMGraph utilizes the highperformance operations with RDMA while minimizing the resource contention within a computer node. In addition, SHMEMGraph synthesizes a number of optimizations to address both computation imbalance and communication imbalance. By using a graph of 1 billion edges, our evaluation shows that compared to the stateoftheart Gemini framework, SHMEMGraph achieves an average improvement of 35.5% in terms of job completion time for five representative graph algorithms.more » « less

We describe a paradigm for designing parallel algorithms via approximation, and illustrate it on the bedgecover problem. A bedgecover of minimum weight in a graph is a subset $C$ of its edges such that at least a specified number $b(v)$ of edges in $C$ is incident on each vertex $v$, and the sum of the edge weights in $C$ is minimum. The Greedy algorithm and a variant, the LSE algorithm, provide $3/2$approximation guarantees in the worstcase for this problem, but these algorithms have limited parallelism. Hence we design two new $2$approximation algorithms with greater concurrency. The MCE algorithm reduces the computation of a bedgecover to that of finding a b'matching, by exploiting the relationship between these subgraphs in an approximation context. The LSENW is derived from the LSEalgorithm using static edge weights rather than dynamically computing effective edge weights. This relaxation gives LSE a worse approximation guarantee but makes it more amenable to parallelization. We prove that both the MCE and LSENW algorithms compute the same bedgecover with at most twice the weight of the minimum weight edge cover. In practice, the $2$approximation and $3/2$approximation algorithms compute edge covers of weight within $10\%$ the optimal. We implement three of the approximation algorithms, MCE, LSE, and LSENW on shared memory multicore machines, including an Intel Xeon and an IBM Power8 machine with 8 TB memory. The MCE algorithm is the fastest of these by an order of magnitude or more. It computes an edge cover in a graph with billions of edges in $20$ seconds using two hundred threads on the IBM Power8. We also show that the parallel depth and work can be bounded for the Suitor and bSuitor algorithms when edge weights are random.more » « less