skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 17 until 8:00 AM ET on Saturday, May 18 due to maintenance. We apologize for the inconvenience.

Title: Exponential Convergence of Sinkhorn Under Regularization Scheduling
n 2013, Cuturi [9] introduced the SINKHORN algorithm for matrix scaling as a method to compute solutions to regularized optimal transport problems. In this paper, aiming at a better convergence rate for a high accuracy solution, we work on understanding the SINKHORN algorithm under regularization scheduling, and thus modify it with a mechanism that adaptively doubles the regularization parameter η periodically. We prove that such modified version of SINKHORN has an exponential convergence rate as iteration complexity depending on log(l/ɛ) instead of ɛ-o(1) from previous analyses [1, 9] in the optimal transport problems with integral supply and demand. Furthermore, with cost and capacity scaling procedures, the general optimal transport problem can be solved with a logarithmic dependence on 1/ɛ as well.  more » « less
Award ID(s):
Author(s) / Creator(s):
Jonathan Berry, David Shmoys
Date Published:
Journal Name:
SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23)
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Wasserstein distance plays increasingly important roles in machine learning, stochastic programming and image processing. Major efforts have been under way to address its high computational complexity, some leading to approximate or regularized variations such as Sinkhorn distance. However, as we will demonstrate, regularized variations with large regularization parameter will degradate the performance in several important machine learning applications, and small regularization parameter will fail due to numerical stability issues with existing algorithms. We address this challenge by developing an Inexact Proximal point method for exact Optimal Transport problem (IPOT) with the proximal operator approximately evaluated at each iteration using projections to the probability simplex. The algorithm (a) converges to exact Wasserstein distance with theoretical guarantee and robust regularization parameter selection, (b) alleviates numerical stability issue, (c) has similar computational complexity to Sinkhorn, and (d) avoids the shrinking problem when apply to generative models. Furthermore, a new algorithm is proposed based on IPOT to obtain sharper Wasserstein barycenter. 
    more » « less
  2. In this work, we develop a new framework for dynamic network flow problems based on optimal transport theory. We show that the dynamic multicommodity minimum-cost network flow problem can be formulated as a multimarginal optimal transport problem, where the cost function and the constraints on the marginals are associated with a graph structure. By exploiting these structures and building on recent advances in optimal transport theory, we develop an efficient method for such entropy-regularized optimal transport problems. In particular, the graph structure is utilized to efficiently compute the projections needed in the corresponding Sinkhorn iterations, and we arrive at a scheme that is both highly computationally efficient and easy to implement. To illustrate the performance of our algorithm, we compare it with a state-of-the-art linear programming (LP) solver. We achieve good approximations to the solution at least one order of magnitude faster than the LP solver. Finally, we showcase the methodology on a traffic routing problem with a large number of commodities. Funding: This work was supported by KTH Digital Futures, Knut och Alice Wallenbergs Stiftelse [Grants KAW 2018.0349, KAW 2021.0274, the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation], Vetenskapsrådet [Grant 2020-03454], and the National Science Foundation [Grants 1942523 and 2206576]. 
    more » « less
  3. Transportation cost is an attractive similarity measure between probability distributions due to its many useful theoretical properties. However, solving optimal transport exactly can be prohibitively expensive. Therefore, there has been significant effort towards the design of scalable approximation algorithms. Previous combinatorial results [Sharathkumar, Agarwal STOC '12, Agarwal, Sharathkumar STOC '14] have focused primarily on the design of near-linear time multiplicative approximation algorithms. There has also been an effort to design approximate solutions with additive errors [Cuturi NIPS '13, Altschuler \etal\ NIPS '17, Dvurechensky \etal\, ICML '18, Quanrud, SOSA '19] within a time bound that is linear in the size of the cost matrix and polynomial in C/\delta; here C is the largest value in the cost matrix and \delta is the additive error. We present an adaptation of the classical graph algorithm of Gabow and Tarjan and provide a novel analysis of this algorithm that bounds its execution time by \BigO(\frac{n^2 C}{\delta}+ \frac{nC^2}{\delta^2}). Our algorithm is extremely simple and executes, for an arbitrarily small constant \eps, only \lfloor \frac{2C}{(1-\eps)\delta}\rfloor + 1 iterations, where each iteration consists only of a Dijkstra-type search followed by a depth-first search. We also provide empirical results that suggest our algorithm is competitive with respect to a sequential implementation of the Sinkhorn algorithm in execution time. Moreover, our algorithm quickly computes a solution for very small values of \delta whereas Sinkhorn algorithm slows down due to numerical instability. 
    more » « less
  4. Abstract

    Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginalskand their support sizesn. This paper develops a general theory about what “structure” makes MOT solvable in$$\mathrm {poly}(n,k)$$poly(n,k)time. We develop a unified algorithmic framework for solving MOT in$$\mathrm {poly}(n,k)$$poly(n,k)time by characterizing the structure that different algorithms require in terms of simple variants of the dual feasibility oracle. This framework has several benefits. First, it enables us to show that the Sinkhorn algorithm, which is currently the most popular MOT algorithm, requires strictly more structure than other algorithms do to solve MOT in$$\mathrm {poly}(n,k)$$poly(n,k)time. Second, our framework makes it much simpler to develop$$\mathrm {poly}(n,k)$$poly(n,k)time algorithms for a given MOT problem. In particular, it is necessary and sufficient to (approximately) solve the dual feasibility oracle—which is much more amenable to standard algorithmic techniques. We illustrate this ease-of-use by developing$$\mathrm {poly}(n,k)$$poly(n,k)-time algorithms for three general classes of MOT cost structures: (1) graphical structure; (2) set-optimization structure; and (3) low-rank plus sparse structure. For structure (1), we recover the known result that Sinkhorn has$$\mathrm {poly}(n,k)$$poly(n,k)runtime; moreover, we provide the first$$\mathrm {poly}(n,k)$$poly(n,k)time algorithms for computing solutions that are exact and sparse. For structures (2)-(3), we give the first$$\mathrm {poly}(n,k)$$poly(n,k)time algorithms, even for approximate computation. Together, these three structures encompass many—if not most—current applications of MOT.

    more » « less
  5. null (Ed.)
    In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is l2-regularized with parameter ! and individual machines are each given a sketch of size m, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by (See PDF), where d(See PDF) is the (See PDF)-effective dimension of the Hessian (or, for quadratic problems, the data matrix). 
    more » « less