<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Combinatorial Algorithm for Approximating the Optimal Transport in the Parallel and MPC Settings</title></titleStmt>
			<publicationStmt>
				<publisher>NeurIPS</publisher>
				<date>12/10/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10513313</idno>
					<idno type="doi"></idno>
					<title level='j'>Proc. 37th Neural Information Processing Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>N Lahn</author><author>S Raghvendra</author><author>K Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Optimal Transport is a popular distance metric for measuring similarity between distributions. Exact and approximate combinatorial algorithms for computing the optimal transport distance are hard to parallelize. This has motivated the development of numerical solvers (e.g. Sinkhorn method) that can exploit GPU parallelism and produce approximate solutions. We introduce the first parallel combinatorial algorithm to find an additive "approximation of the OT distance. The parallel complexity of our algorithm is O(log(n)/" 2 ) where n is the total support size for the input distributions. In Massive Parallel Computation (MPC) frameworks such as Hadoop and MapReduce, our algorithm computes an "-approximate transport plan in O(log(log(n/"))/" 2 ) rounds with O(n/") space per machine; all prior algorithms in the MPC framework take ⌦(log n) rounds. We also provide a GPU-friendly matrix-based interpretation of our algorithm where each step of the algorithm is row or column manipulation of the matrix. Experiments suggest that our combinatorial algorithm is faster than the state-of-the-art approximate solvers in the GPU, especially for higher values of n.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Optimal transport (OT) is a useful metric for measuring similarity between distributions and has numerous applications <ref type="bibr">[2,</ref><ref type="bibr">4,</ref><ref type="bibr">5,</ref><ref type="bibr">9,</ref><ref type="bibr">12,</ref><ref type="bibr">29]</ref>, including image retrieval <ref type="bibr">[28]</ref>, GAN training <ref type="bibr">[23]</ref>, and interpolation between distributions <ref type="bibr">[7]</ref>. Given two distributions &#181; and &#9003;, this metric captures the minimum-cost plan for transporting mass from &#181; to &#9003;.</p><p>More formally, in the optimal transport problem, we are given two discrete distributions &#181; and &#9003; whose supports are the point sets A and B, respectively. For each point a 2 A (resp. b 2 B), we associate a probability of &#181; a (resp. &#9003; b ) with it such that P a2A &#181; a = P b2B &#9003; b = 1. We refer to each point of A as a demand point and each point in B as a supply point. For any edge (a, b) 2 A &#8677; B, we are given a cost c(a, b); we assume that the costs are scaled so that the largest cost edge is 1. Let c(a, b) be the cost of transporting a supply amount of from b to a. A transport plan is a function : A &#8677; B ! R 0 that assigns a non-negative value to each edge of G, indicating the amount of supply transported along the edge. The transport plan is such that the total supplies transported into (resp. from) any demand (resp. supply) node a 2 A (resp. b 2 B) is bounded by the demand (resp. supply) at a (resp. b). The cost of the transport plan, denoted by c( ), is given by P (a,b)2A&#8677;B (a, b)c(a, b). In this optimal transport problem, we are interested in finding a minimum-cost transport plan that transports all of the supply, denoted by &#8676; . We also define an "-approximate transport plan to be any transport plan with a cost c( ) &#63743; c( &#8676; ) + " that transports all of the supply.</p><p>The special case where A and B each contain n points each and where every point in A (resp. B) has a demand of 1/n (resp. supply of 1/n) is called the assignment problem. In this special case, there is an optimal transport plan with a special structure; specifically, when there are n vertex-disjoint edges (a, b) with (a, b) = 1/n, they form a perfect matching. Let the cost of any matching M , denoted by c(M ) be the total cost of all of its edges, i. Given a perfect matching M , the cost of the corresponding transport plan is simply</p><p>For simplicity in exposition, in the context of the assignment problem, we will uniformly scale all the demands and supplies from 1/n to 1. This does not change the optimal transport plan. It, however, increases the cost of the optimal transport plan to c(M ), an increase by a factor of n. Thus, for the assignment problem, finding the optimal transport plan is equivalent to finding a minimum-cost perfect matching M &#8676; .</p><p>Similarly, for an " &gt; 0, after scaling the demands and supplies from 1/n to 1, an "-approximate transport plan corresponds to a perfect matching M with cost c(M ) &#63743; c(M &#8676; ) + "n. We refer to such a perfect matching as "-approximate matching. Thus, for the assignment problem, finding an "-approximate transport plan corresponds to finding an "-approximate matching.</p><p>Related Work: For discrete distributions, the optimal transport problem can be formulated as a minimum-cost flow problem and solved using any LP-solver. The best-known exact and approximate solver for optimal transport, as well as the assignment problem, are graph-based combinatorial algorithms <ref type="bibr">[25,</ref><ref type="bibr">19,</ref><ref type="bibr">17,</ref><ref type="bibr">27]</ref>. These solvers, however, are known to be difficult to parallelize. For instance, the best "-approximate OT solver, in terms of sequential running time, was given by Lahn et al. <ref type="bibr">[19]</ref>. This combinatorial algorithm runs in O(n 2  /" + n/" 2 ) time, and is a non-trivial adaptation of the classical combinatorial exact algorithm by Gabow and Tarjan (GT-algorithm) for the transportation problem <ref type="bibr">[13]</ref>. The Lahn et al. algorithm runs no more than b2/"c + 1 iterations, where each iteration executes a Dijkstra's shortest path search to find and augment along a set of "augmenting paths". This algorithm is the state-of-the-art in the sequential setting; see <ref type="bibr">[22]</ref> for a discussion on the various algorithms. Unfortunately, however, the flow augmentations have to be done in a sequential manner, making this algorithm hard to parallelize.</p><p>Motivated by the need for efficient scalable solutions, machine learning researchers have designed highly parallelizable approximate solvers that generate an "-approximate transport plan in &#213;(n 2 /" O(1) ) sequential time and &#213;(1/" O(1) ) parallel time. Perhaps the most successful among these is an entropy regularized version of the optimal transport, which can be solved using the Sinkhorn-Knopp method <ref type="bibr">[1,</ref><ref type="bibr">8]</ref> and produces an "-approximation to the optimal transport in &#213;(n 2 /" 2 ) sequential time and &#213;(1/" 2 ) parallel time <ref type="bibr">[10]</ref>. The simplicity of this algorithm has lead to an efficient GPU implementation. From a theoretical standpoint, Jambulapati et al. <ref type="bibr">[16]</ref> designed a dual extrapolation algorithm using an area convex mapping <ref type="bibr">[31]</ref> to achieve an improved parallel complexity of O((log(n) log log n)/"). However, as noted by Lin et al. <ref type="bibr">[22]</ref>, despite its sound theoretical guarantees, the lack of simplicity and the difficulties of implementation make this algorithm by Jambulapati et al. less competitive, and the Sinkhorn algorithm remains the state-of-the-art for approximating the Optimal Transport on GPUs.</p><p>Despite being the state-of-the-art in sequential settings, combinatorial algorithms have remained difficult to parallelize and all known exact or approximate combinatorial algorithms run in only slightly sub-linear parallel time <ref type="bibr">[14]</ref>. In this paper, we design the first parallel combinatorial algorithm that takes only O(log(n)/" 2 ) parallel time and finds an "-approximation transport plan.</p><p>Our algorithm also improves upon existing algorithms in the massive parallel computation frameworks such as Hadoop, MapReduce, Dryad and Spark. In the Massively Parallel Computing (MPC) model, we are given a set of machines, where each machine has a bounded amount of memory. For this model, the base assumption is that communication between machines is the performance bottleneck, and the goal is to minimize the number of synchronized communication rounds, where each round consists of a period of local computation on each machine, followed by a period of message communication between machines. It is well known that any standard parallel algorithm that takes O(f (n)) time can be directly translated to an algorithm under the MPC model that runs in O(f (n)) communication rounds. However, algorithms that are more specialized for the MPC model can acheive drastically faster computation times, often having a sub logarithmic number of rounds. For example, it has long been known how to compute a maximal matching in O(log n) parallel time <ref type="bibr">[15]</ref>, but only recently was a breakthrough made that shows how to compute a maximal matching in O(log log n) rounds under the MPC model <ref type="bibr">[3]</ref>. Maximal matching is a substantially simpler problem than both the assignment and OT problems. For these OT problem, known parallel "-approximation algorithms immediately yield an &#213;(log(n)/") round algorithm for the MPC model <ref type="bibr">[16]</ref>. To our knowledge, no specialized MPC algorithms are known for the either problem. Thus, we provide the first sub-logarithmic round "-approximation algorithm for both the assignment and OT problems. We obtain this bound by leveraging the recent breakthrough MPC algorithm for maximal matching by <ref type="bibr">[3]</ref>.</p><p>Our Results: In this paper, we present a very simple combinatorial algorithm to compute an "-approximate transport plan in O(n 2 /" 2 ) sequential time and O(log(n)/" 2 ) parallel time. For the special case of the assignment problem, the sequential execution time of our algorithm improves to O(n 2 /"). We also provide a GPU implementation of our algorithm that outperforms the implementation of the Sinkhorn algorithm provided by Python Optimal Transport library <ref type="bibr">[11]</ref>.</p><p>Our algorithm also extends to the well-known Massive Parallel Computation (MPC) frameworks such as MapReduce, Hadoop, Dryad and Spark. In the MPC model, our algorithm computes a "-approximate transport plan in O(log(log n)/" 2 ) rounds with O(n) memory per machine. Our algorithm is based on the popular push-relabel framework <ref type="bibr">[14]</ref> for computing minimum-cost flow.</p><p>Theorem 1.1. Given an " &gt; 0, there is an algorithm that computes an "-approximate matching in O(n 2  /") time. Furthermore, one can execute this algorithm in expected O(log(n)/" 2 ) parallel time or in expected O(log(log n)/" 2 ) rounds, with O(n) memory per machine, in the MPC model.</p><p>Extension to the Optimal Transport problem: For any " &gt; 0, Lahn et al. <ref type="bibr">[19]</ref> showed that computing an "-approximation of the optimal transport between two discrete distributions containing n points in their support reduces to an instance of an unbalanced assignment problem with n/" points. We apply this reduction and slightly adapt our algorithm from Theorem 1.1 to obtain our result for the optimal transport problem (Theorem 1.2). In this paper, we present details of our algorithm for the assignment problem. Details of the adaptation of our algorithm to the optimal transport by using the reduction of <ref type="bibr">[19]</ref> is presented in the Section B of the appendix. Theorem 1.2. Given an " &gt; 0, there is an algorithm that computes an "-approximate transport plan in O(n 2 /" 2 ) time. Furthermore, one can execute this algorithm in expected O(log(n)/" 2 ) parallel time or in O(log(log(n/"))/" 2 ) rounds with O(n/") memory per machine in the MPC model.</p><p>From a theoretical stand-point, we provide the first parallel combinatorial algorithm for approximating the optimal transport with a expected parallel execution time of O(log(n)/" 2 ). The sequential execution time of O(n 2 /") for our algorithm for the assignment problem matches with the current state-of-the-art for the problem <ref type="bibr">[16,</ref><ref type="bibr">19]</ref>. We also provide the first sub-logarithmic round algorithm that approximates the optimal transport plan in the MPC model.</p><p>From a practical stand-point, for both the assignment problem and the OT problem, we provide an implementation that exploits GPU parallelism. Experiments suggest that both of our GPU implementations outperform the GPU implementation of the state-of-the-art Sinkhorn algorithm provided by the Python Optimal Transport library <ref type="bibr">[11]</ref> in terms of running time, while achieving the same level of accuracy.</p><p>Our Approach: Our algorithmic approach is based on the popular push-relabel framework for computing network flows. For the assignment problem, our algorithm maintains a matching M and a set of dual weights y(&#8226;) on vertices of A [ B. The algorithm runs in O(1/" 2 ) iterations and in each iteration, it executes three steps. First, it greedily computes a maximal matching M 0 . In the second step, it uses M 0 to update the matching M (the push step). Finally, it updates the dual weights (relabel step). Our proof of correctness is based on the standard dual feasibility conditions used to compute minimum-cost maximum cardinality matchings, with some modifications made to better accommodate our additive-approximate setting. Our main technical difference from standard push-relabel techniques is the novel running time analysis for the additive approximate setting. In particular, we show that the number of iterations required by our algorithm is just O(1/" 2 ). Within each iteration, the push and relabel steps take only O(n) sequential time and O(1) parallel time. The only non-trivial step, therefore, is the computation of a maximal matching which can be done in O(n 2 ) sequential time and O(log n) parallel time <ref type="bibr">[15]</ref>. Maximal matchings can also be computed in O(log log n) rounds in the massively parallel computation (MPC) model <ref type="bibr">[3]</ref>. As a result, our algorithm can also be executed in O(log(log n)/" 2 ) rounds in the MPC model, for the assignment problem. We extend our algorithm to also approximate the optimal transport plan by using the reduction of Lahn et al. <ref type="bibr">[19]</ref> (see Section B of the appendix for details).</p><p>Organization: In Section 2.1, we present the definitions required to describe our algorithm. In Section 2.2, we present our algorithm for the assignment problem. For simplicity of exposition, we present an algorithm that computes a 3"-approximation of the optimal solution to the assignment problem. To obtain an "-approximation, one can simply choose the error factor in the algorithm to be "/3. In Section 3, we prove the sequential complexity of our algorithm for the assignment problem. In Section 4, we analyze the complexity of our algorithm in the parallel and MPC settings and also describe a GPU-friendly implementation of our matching algorithm. Finally, we present the experimental results in Section 5. In the appendix, Section B, we extend our algorithm to the optimal transport problem. All missing proofs can also be found in the appendix, Section A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Algorithm</head><p>In this section, given an input to the assignment problem and a value 0 &lt; " &lt; 1, we present an algorithm that computes a 3"-approximate matching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Preliminaries</head><p>We begin by introducing the terminologies required to understand our algorithm for the assignment problem. For any matching M , we say that any vertex v 2 A [ B is free if v is not matched in M and matched otherwise. Our algorithm critically uses the notion of a maximal matching which we introduce next. For any bipartite graph that is not necessarily complete, any matching M is maximal if and only if at least one end point of every edge in the graph is matched in M . Thus, if a matching is not maximal, there is at least one edge between two free vertices. One can, therefore, compute a maximal matching in a greedy fashion by iteratively picking such an edge and adding it to the matching.</p><p>For every edge (u, v) 2 A &#8677; B, we transform its cost so that it becomes an integer multiple of " as follows:</p><p>The rounding of edge costs may introduce an error that is bounded by " for each edge and by at most "n for any matching. Our algorithm assigns a dual weight y(v) for every v 2 A [ B such that a set of relaxed dual feasibility conditions are satisfied. A matching M along with dual weights y(&#8226;) is</p><p>In Lemma 3.1, we show that any "-feasible matching produced by our algorithm has a cost within an additive error of " from the optimal solution with respect to the costs c(&#8226;, &#8226;). For any edge (u, v), we define its slack s(u, v) to be 0 if (u, v) 2 M . Otherwise, if (u, v) 6 2 M , we set its slack to be s(u, v) = c(u, v) + " y(u) y(v). We say that (u, v) is admissible if the slack on the edge is 0.</p><p>We observe that any matching M whose cardinality is at least (1 ")n can be converted into a perfect matching simply by arbitrarily matching the remaining "n free vertices. The cost of any edge is at most 1, and so, this increases the cost of the matching M by at most "n. In addition to this, the rounding of costs from c(&#8226;, &#8226;) to c(&#8226;, &#8226;) also introduces an increase of cost by "n. Finally, the "-feasibility conditions introduced an additional additive error of "n, for a total error of 3"n, as desired. Thus, in the rest of this section, we present an algorithm that computes an "-feasible matching of cardinality at least (1 ")n, which has a cost no more than "n above the optimal matching's cost with respect to c(&#8226;, &#8226;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Algorithm Details</head><p>Initially, we set the dual weight of every vertex b 2 B to be " and every vertex a 2 A to be 0. We initialize M to ;. Our initial choice of M and the dual weights satisfies (2) and (3). Our algorithm executes iterations, which we will call phases. Within each phase, the algorithm constructs the set B 0 , which consists of all free vertices of B. If |B 0 | &#63743; "n, then M is an "-feasible matching of cardinality at least (1 ")n, and the algorithm will arbitrarily match the remaining free vertices and return the resulting matching. Otherwise, the algorithm computes the subset E 0 &#10003; E of admissible edges with at least one end point in B 0 . Let A 0 = {a | a 2 A and (a, b) 2 E 0 }, i.e., the set of points of A that participate in at least one edge in E 0 . For each phase, the algorithm executes the following steps:</p><p>(I) Greedy step: Computes a maximal (i.e., greedy) matching M 0 in the graph G 0 (A 0 [ B 0 , E 0 ). (II) Matching Update: Let A 00 be the set of points of A 0 that are matched in both M and M 0 and let M 00 be the edges of M that are incident on some vertex of A 00 . The algorithm adds the edges of M 0 to M and deletes the edges of M 00 from M . (III) Dual Update:</p><p>a. For every edge (a, b) 2 M 0 , the algorithm sets y(a) y(a) ", and b. For every vertex b 2 B 0 that is free with respect to M 0 , the algorithm sets y(b) y(b) + ".</p><p>In each phase, the matching update step will add edges of M 0 to M and remove edges of M 00 from M . By construction, the updated set M is a matching. Furthermore, every vertex of A that was matched prior to the update continues to be matched after the update.</p><p>Lemma 2.1. The new set M of edges obtained after Step (II) is a matching. Furthermore, any vertex of A that was matched prior to Step (II) will continue to be matched after the execution of Step (II).</p><p>The dual update step increases or reduces dual weights by ". Therefore, the dual weights always remain an integer multiple of ".</p><p>The algorithm maintains the following invariants:</p><p>(I1) The dual weight of every vertex in B (resp. A) is non-negative (resp. non-positive). Furthermore, every free vertex of A has a dual weight of 0. (I2) The matching M and a set of dual weights y(&#8226;) is "-feasible.</p><p>Proofs of these invariants can be found in the appendix Section A.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Analysis</head><p>Next, in Section 3.1, we use invariants (I1) and (I2) to show that the algorithm produces a matching with the desired accuracy. In Section 3.2, we use the invariants to bound the sequential and parallel execution times of our algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Accuracy</head><p>As stated in Section 2.1, the rounding of costs from c(&#8226;, &#8226;) to c(&#8226;, &#8226;) introduces an error of "n. Furthermore, after obtaining a matching of size at least (1 ")n, the cost of arbitrarily matching the last "n vertices is no more than "n. From the following lemma, we can conclude that the total error in the matching computed by our algorithm is no more than +3"n. Proof of this Lemma can be found in the appendix Section A.2 Lemma 3.1. The "-feasible matching of size at least (1 ")n that is produced by the main routine of our algorithm is within an additive error of "n from the optimal matching with respect to the rounded costs c(&#8226;, &#8226;)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Efficiency</head><p>Suppose there t phases executed by the algorithm. We use n i to denote the size of B 0 in phase i. By the termination condition, each phase is executed only if B 0 has more than "n vertices, i.e., n i &gt; "n. First, in Lemma 3.2 (appendix A.3), we show that the magnitude of the dual weight of any vertex cannot exceed (1 + 2"). This means the total dual weight magnitude over all vertices is upper bounded by n(1 + 2"). Furthermore, in Lemma 3.3 (appendix A.4) we show that, during phase i, the total dual weight magnitude increases by at least "n i . From this, we can conclude that</p><p>Note that, since each n i "n, we immediately get t"n &#63743; n(1+2")/", or t &#63743; (1+2")/" 2 = O(1/" 2 ).</p><p>In order to get the total sequential execution time, we show, in Lemma 3.4 (appendix A.5) that each phase can be efficiently executed in O(n &#8677; n i ) time. Combining this with equation ( <ref type="formula">4</ref>) gives an overall sequential execution time of O(n(</p><p>/"). Lemma 3.2. For any vertex v 2 A [ B, the magnitude of its dual weight cannot exceed 1 + 2", i.e., |y(v)| &#63743; (1 + 2").</p><p>Lemma 3.3. The sum of the magnitude of the dual weights increases by at least "n i in each iteration. Lemma 3.4. The execution time of each phase is O(n &#8677; n i ) time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Analysis for the Unbalanced Case</head><p>In this section, we describe how the analysis of our matching algorithm can be extended to work for the unbalanced case, where |A| 6 = |B|. This analysis is critical for proving the correctness of our optimal transport version of the algorithm. Without loss of generality, assume |B| &#63743; |A| = n. The overall description of the algorithm remains the same, except for the main routine of our algorithm produces an "-feasible matching of size at least (1 ")|B|. The asymptotic running time of both the parallel and sequential algorithms remains unchanged. In the following lemma, we bound the additive error of our algorithm for the unbalanced case; the argument is very similar to Lemma 3.1. Lemma 3.5. Given an unbalanced input to the assignment problem with |B| &#63743; |A|, the "-feasible matching of cardinality at least (1 ")|B| that is returned by our algorithm is within an additive error of "|B| from the optimal matching with respect to the cost function c(&#8226;, &#8226;)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Parallel Algorithm</head><p>In this section, we describe how to parallelize the matching algorithm of Section 2.2 leading to the result of Theorem 1.1. Recall that each phase of this algorithm has three steps : (I) Greedy step, (II) Matching update, and (III) Dual update. Steps (II) and (III) are easily parallelizable using O(1) time. However, step (I) is nontrivial to parallelize. Fortunately, Isreali and Itai gave a O(log n) randomized parallel algorithm for computing a maximal matching on an arbitrary graph <ref type="bibr">[15]</ref>. Therefore, we can complete step (I) by applying their algorithm as a black box. However their algorithm is very generic, applying even for non-bipartite graphs. In Section 4.1, we use the Israeli Itai algorithm to bound the parallel complexity of our algorithm. In Section 4.2, we further parallelize the phases of our algorithm leading to a simplified variation of the Israeli Itai algorithm that is more suited for a practical implementation. Finally, in Section 4.3, we provide a matrix-based interpretation of our simplified algorithm, allowing the algorithm to be easily implemented for GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Analysis of our Algorithm for Parallel and MPC Models</head><p>The Israeli Itai algorithm is designed to work on an arbitrary graph G(V, E), which may not be bipartite. It computes a maximal matching on G in O(log n) iterations. By directly using their O(log n) algorithm for step (I) of our algorithm, we obtain an O(log n) parallel running time for each phase. Since our algorithm executes O(1/" 2 ) phases, we obtain a worst-case theoretical bound of O(log n/" 2 ) for our algorithm.</p><p>We would also like to note that specialized algorithms for maximal matching exist for the MPC model as well. For example, it is possible to compute a maximal matching under the MPC model using just O(log(log( ))) rounds and O(n) space per machine, where is the maximum degree of the graph. <ref type="bibr">[3]</ref>). As a result, we are able to achieve an algorithm in the MPC model that requires only O(log(log n)/" 2 ) rounds with O(n) space per machine.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Simplifying the Parallel Implementation of our Algorithm</head><p>Instead of using Israeli Itai algorithm (which works for any arbitrary graph) as a black-box, we use a simpler adaptation of their algorithm for bipartite graphs. The Israeli Itai algorithm executes O(log n) iterations, where each iteration executes the following steps, using O(1) parallel time:</p><p>(i) Each vertex u 2 V selects an incident edge (u, v) at random, and directs it from u to v, yielding a directed subgraph R &#10003; E. (ii) Each vertex u 2 V selects, at random, one incoming edge from R. Let S be the set of edges chosen by this step, with all directions removed. The graph S has a maximum vertex degree of 2. (iii) Each vertex selects an edge of S at random. Any edge that is selected by both of its endpoints is added to M , and any vertex matched by this step is removed for the next phase.</p><p>The Israeli Itai algorithm is designed to work for any graph that is not necessarily bipartite. In this situation, there is no way to partition the vertices into sets A and B, and so it is necessary to consider edges as having two directions; a vertex u could 'propose' an outgoing edge (u, v) (see step (i)) and also 'receive' multiple proposals as incoming edges (see step (ii)). In the non-bipartite case, it is necessary for every vertex to both send and receive proposals; all vertices must be handled using a symmetric process. This results in a subgraph S, where each vertex could have a degree of 2, and an additional step is required to eliminate some of these edges in order to form a matching (see step (iii)).</p><p>However, in our situation, we are solely working with bipartite graphs. In this situation, the vertices are divided into two sets A and B, and, as a result, we do not need to process each vertex in a symmetric fashion. Instead, we can allow one side to make proposals and the other side to receive proposals. As a result, for each iteration of the maximal matching algorithm, we can execute the following steps, which correspond to steps (i) and (ii) in the Israeli Itai algorithm.</p><p>(a) Each vertex of B selects, at random, an incident edge, yielding a subgraph S. (b) Each vertex of A, with degree at least one in S, arbitrarily selects an edge from S and adds it to the matching.</p><p>Note that, after step (b) in our approach, each vertex has at most one incident edge selected. This alleviates the need for step (iii), since steps (a) and (b) alone immediately result in a matching.</p><p>In addition to our simplified approach to each iteration of the Israeli Itai algorithm, we make a second optimization: Within our algorithm, instead of waiting for the Israeli Itai algorithm to complete, which could require O(log n) iterations, our implementation immediately updates the matching and dual weights after each iteration before moving to the next iteration of the Israeli Itai algorithm. While this increases the number of phases, each phase becomes very simple, taking only O(1) time, resulting in practical improvements. We believe that this additional source of parallelization could lead to asymptotic improvements to the parallel complexity of our algorithm. However, the proofs of Israeli Itai do not readily extend to this modified algorithm. Obtaining a tight bound on the parallel complexity of our modified algorithm is an important open question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">A GPU-Friendly Implementation</head><p>Thus far, we have described our algorithm using graph theoretic notations. However, in practice, it is important for our algorithm to have an efficient GPU-based implementation. In this section, we provide a matrix-based implementation of our algorithm, using the simplified maximal matching approach described in Section 4.2.</p><p>In Algorithm 1, we provide a pseudocode of our simplified algorithm. The algorithm resembles the one described in Section 2.2, except for the differences described in Section 4.2. It assumes that the costs were already rounded to each be an even multiple of ". The matching returned by the algorithm has cardinality at least (1 ")n and a cost at most "n above the optimal cost. The algorithm, as written, does not maintain dual weights explicitly, but if dual weights are required as part of the output, then, as discussed below, the algorithm can be modified to keep track of them. Note that all operations in this psuedocode are based on relatively simple matrix operations, and can be implemented easily on a GPU. for all columns b with all zero entries in M do 6:</p><p>Randomly select a row a such that S a,b = 0</p><p>an assignment problem between randomly generated synthetic point sets, an OT problem between randomly generated point sets, each having randomly assigned demands and supplies, an assignment problem formed from two sets of MNIST images, and an OT problem between two text word embeddings. For each setup, we generated input data and computed the assignment or OT cost using our algorithm with different values of ". Then, we determined the appropriate regularization parameter of the Sinkhorn algorithm, ensuring the Sinkhorn distance is close but no lower than the cost of the solution generated by our algorithm. We recorded the running time and the number of parallel rounds for both Sinkhorn and our algorithm. We also repeated each experiment using a reversed process by fixing the regularization parameter of Sinkhorn and searching for the " value for our algorithm, which guarantees the our cost is similar to, but no more than, the Sinkhorn distance; the results for these reversed experimental setups can be found in the technical appendix . Note that, we also recorded the execution time of solving for the exact solution using POT's EMD function, which runs on the CPU. We only present for values of " that are large enough such that either our algorithm or Sinkhorn algorithms run faster than the exact algorithm. We also record the additive error, relative to the optimal solution, of both Sinkhorn and our algorithm in appendix section C.1. Additionally, we conducted experiments comparing the sequential performances of our algorithm and Sinkhorn on the CPU, which can be found in the appendix section C.4.</p><p>Synthetic Data: For synthetic data generation, for both the assignment and OT experiments, we randomly sampled the location of two groups of n = 10, 000 vertices, A and B, in a 2-dimensional unit square. For the assignment problem, the demand or supply of every vertex is 1/n. For the OT problem, the capacity of each vertex is initially chosen by selecting, uniformly at random from the interval [0, 1]. Then, the capacities are normalized such that the total supply and demand are both equal to 1. For any pair of points (a, b) 2 A &#8677; B, the cost c(a, b) was set as the squared Euclidean distance between them. For each value of ", we executed 10 runs. For each combination of ", and algorithm choice, we averaged both the running times as well as the number of parallel rounds over all 10 runs and recorded the results. The results of running times and parallel rounds can be seen in Figure <ref type="figure">1</ref>  MNIST: Next, we ran a similar experiment using real-world image data. We generated our inputs using the MNIST dataset of hand-written digit images <ref type="bibr">[21]</ref>. Each image consists of a 28 &#8677; 28 pixel gray-scale image. The sets A and B each consist of n = 10, 000 images from the MNIST dataset, selected at random. The cost c(a, b) between two images a 2 A and b 2 B is computed as follows: Let a(i, j) (resp. b(i, j)) be the value of the pixel in row i and column j of image a (resp. b). First, the two images are normalized so that the sum of all pixel values is equal to 1 for each image, i.e., P i,j2 <ref type="bibr">[1,</ref><ref type="bibr">28]</ref> a(i, j) = 1 and P i,j2 <ref type="bibr">[1,</ref><ref type="bibr">28]</ref> b(i, j) = 1. Then, the cost c(a, b) is given by the L 1 distance between the resulting normalized images: c(a, b) = P i,j2 <ref type="bibr">[1,</ref><ref type="bibr">28]</ref> |a(i, j) b(i, j)|. Note that an upper bound on the largest cost is 2. For each algorithm and for each value of ", we averaged both the running times as well as the number of parallel rounds over 10 runs. The results for these experiments can be found in the plot in Figure <ref type="figure">1</ref>(c) and the plot in Figure <ref type="figure">2</ref>(c). NLP: In our final experiment, we considered OT with natural language data. We calculate the document distances based on OT following the procedure of previous work <ref type="bibr">[18]</ref>. Each OT instance was generated by selecting two discrete sections of text with fixed lengths. Each unique word in the first (resp. second) section of text corresponds to a vertex of A (resp. B) in the OT problem. To acquire the supply and demand for each vertex, we tokenized each text section with NLTK <ref type="bibr">[6]</ref>, and counted the number of appearances of each unique token. Then the counts were normalized so that the total supply and demand were each equal to 1. Next, to generate the costs between vertices, we represent each unique token using a 100-dimensional GloVe word embedding <ref type="bibr">[26]</ref>. The cost of any edge (a, b) 2 A &#8677; B is then given by the Euclidean distance between the corresponding points in this embedding.</p><p>In the last three plots in Figure <ref type="figure">1</ref> and Figure <ref type="figure">2</ref>, we show the results of applying this experimental setup, using sections of the text from The Count of Monte Cristo, 20News, IMDB. For each dataset, five different OT instances are created, using different sections of the text, and the results are averaged over all 5 runs. These experiments can be found in Figure <ref type="figure">1</ref> In all our experiments, our new parallel combinatorial algorithm almost always runs significantly faster than the Sinkhorn algorithm for the OT problem often with significantly fewer parallel rounds. Unlike the POT's highly optimized implementation of the Sinkhorn method, the implementation of our algorithm is new and may benefit significantly from further optimizations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this work, we provided a fast, highly parallelizable combinatorial algorithm for computing an "-approximate solution to the assignment problem and the OT problem. We also provided a practical implementation of a slight variation of our algorithm, which outperforms the Sinkhorn algorithm, in terms of running times, in our experimental comparison. In light of this work, we would like to propose the following open question : Is it possible to improve the O(log(n)/" 2 ) parallel running time of our algorithm, possibly by introducing an appropriate modification? In particular, our simplified practical algorithm presented in Section 4.2 seems to execute fewer parallel rounds than the our worst-case theoretical analysis might suggest. Can the simplifications used in our practical implementation be used to improve our worst-case theoretical running times?</p></div></body>
		</text>
</TEI>
