Spatial join is an important operation for combining spatial data. Parallelization is essential for improving spatial join performance. However, load imbalance due to data skew limits the scalability of parallel spatial join. There are many work sharing techniques to address this problem in a parallel environment. One of the techniques is to use data and space partitioning and then scheduling the partitions among threads/processes with the goal of minimizing workload differences across threads/processes. However, load imbalance still exists due to differences in join costs of different pairs of input geometries in the partitions. For the load imbalance problem, we have designed a work stealing spatial join system (WSSJ-DM) on a distributed memory environment. Work stealing is an approach for dynamic load balancing in which an idle processor steals computational tasks from other processors. This is the first work that uses work stealing concept (instead of work sharing) to parallelize spatial join computation on a large compute cluster. We have evaluated the scalability of the system on shared and distributed memory. Our experimental evaluation shows that work stealing is an effective strategy. We compared WSSJ-DM with work sharing implementations of spatial join on a high performance computing environment using partitioned and un-partitioned datasets. Static and dynamic load balancing approaches were used for comparison. We study the effect of memory affinity in work stealing operations involved in spatial join on a multi-core processor. WSSJ-DM performed spatial join using ST_Intersection on Lakes (8.4M polygons) and Parks (10M polygons) in 30 seconds using 35 compute nodes on a cluster (1260 CPU cores). A work sharing Master-Worker implementation took 160 seconds in contrast.
more »
« less
Optimizing Work Stealing Communication with Structured Atomic Operations
Applications that rely on sparse or irregular data are often challenging to scale on modern distributed-memory systems. As a result, these systems typically require continuous load balancing in order to maintain efficiency. Work stealing is a common technique to remedy imbalance. In this work we present a strategy for work stealing that reduces the amount of communication required for a steal operation by half. We show that in exchange for a small amount of additional complexity to manage the local queue state we can combine both discovering and claiming work into a single step. Conventionally, workstealing uses a two step process of discovering work and then claiming it. Our system, SWS, provides a mechanism where both processes are performed in a singular communication without the need for multiple synchronization messages. This reduction in communication is possible with the novel application of atomic operations that manipulate a compact representation of task queue metadata. We demonstrate the effectiveness of this strategy using known benchmarks for testing dynamic load balancing systems and for performing unbalanced tree searches. Our results show the reduction in communication reduces task acquisition time and steal time, which in turn improves overall performance on sparse computations.
more »
« less
- Award ID(s):
- 2018758
- PAR ID:
- 10256915
- Date Published:
- Journal Name:
- 50th International Conference on Parallel Processing (ICPP '21)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider large-scale load balancing systems where processing time distribution of tasks depend on both task and server types. We analyze the system in the asymptotic regime where the number of task and server types tend to infinity proportionally to each other. In such heterogeneous setting, popular policies like Join Fastest Idle Queue (JFIQ), Join Fastest Shortest Queue (JFSQ) are known to perform poorly and they even shrink the stability region. Moreover, to the best of our knowledge, in this setup, finding a scalable policy with provable performance guarantee has been an open question prior to this work. In this paper, we propose and analyze two asymptotically delay-optimal dynamic load balancing approaches: (a) one that efficiently reserves the processing capacity of each server for good tasks and route tasks under the Join Idle Queue policy; and (b) a speed-priority policy that increases the probability of servers processing tasks at a high speed. Introducing a novel analytical framework and using the mean-field method and stochastic coupling arguments, we prove that both policies above achieve asymptotic zero queueing, whereby the probability that a typical task is assigned to an idle server tends to 1 as the system scales.more » « less
-
null (Ed.)Applications in cloud platforms motivate the study of efficient load balancing under job-server constraints and server heterogeneity. In this paper, we study load balancing on a bipartite graph where left nodes correspond to job types and right nodes correspond to servers, with each edge indicating that a job type can be served by a server. Thus edges represent locality constraints, i.e., an arbitrary job can only be served at servers which contain certain data and/or machine learning (ML) models. Servers in this system can have heterogeneous service rates. In this setting, we investigate the performance of two policies named Join-the-Fastest-of-the-Shortest-Queue (JFSQ) and Join-the-Fastest-of-the-Idle-Queue (JFIQ), which are simple variants of Join-the-Shortest-Queue and Join-the-Idle-Queue, where ties are broken in favor of the fastest servers. Under a "well-connected'' graph condition, we show that JFSQ and JFIQ are asymptotically optimal in the mean response time when the number of servers goes to infinity. In addition to asymptotic optimality, we also obtain upper bounds on the mean response time for finite-size systems. We further show that the well-connectedness condition can be satisfied by a random bipartite graph construction with relatively sparse connectivity.more » « less
-
Consider a queuing system with K parallel queues in which the server for each queue processes jobs at rate n and the total arrival rate to the system is [Formula: see text], where [Formula: see text] and n is large. Interarrival and service times are taken to be independent and exponentially distributed. It is well known that the join-the-shortest-queue (JSQ) policy has many desirable load-balancing properties. In particular, in comparison with uniformly at random routing, the time asymptotic total queue-length of a JSQ system, in the heavy traffic limit, is reduced by a factor of K. However, this decrease in total queue-length comes at the price of a high communication cost of order [Formula: see text] because at each arrival instant, the state of the full K-dimensional system needs to be queried. In view of this, it is of interest to study alternative routing policies that have lower communication costs and yet have similar load-balancing properties as JSQ. In this work, we study a family of such rank-based routing policies, which we will call Marginal Size Bias Load-Balancing policies, in which [Formula: see text] of the incoming jobs are routed to servers with probabilities depending on their ranked queue length and the remaining jobs are routed uniformly at random. A particular case of such routing schemes, referred to as the marginal JSQ (MJSQ) policy, is one in which all the [Formula: see text] jobs are routed using the JSQ policy. Our first result provides a heavy traffic approximation theorem for such queuing systems in terms of reflected diffusions in the positive orthant [Formula: see text]. It turns out that, unlike the JSQ system, where, due to a state space collapse, the heavy traffic limit is characterized by a one-dimensional reflected Brownian motion, in the setting of MJSQ (and for the more general rank-based routing schemes), there is no state space collapse, and one obtains a novel diffusion limit which is the constrained analogue of the well-studied Atlas model (and other rank-based diffusions) that arise from certain problems in mathematical finance. Next, we prove an interchange of limits ([Formula: see text] and [Formula: see text]) result which shows that, under conditions, the steady state of the queuing system is well approximated by that of the limiting diffusion. It turns out that the latter steady state can be given explicitly in terms of product laws of Exponential random variables. Using these explicit formulae, and the interchange of limits result, we compute the time asymptotic total queue-length in the heavy traffic limit for the MJSQ system. We find the striking result that, although in going from JSQ to MJSQ, the communication cost is reduced by a factor of [Formula: see text], the steady-state heavy traffic total queue-length increases by at most a constant factor (independent of n, K) which can be made arbitrarily close to one by increasing a MJSQ parameter. We also study the case where the system is overloaded—namely, [Formula: see text]. For this case, we show that although the K-dimensional MJSQ system is unstable, unlike the setting of random routing, the system has certain desirable and quantifiable load-balancing properties. In particular, by establishing a suitable interchange of limits result, we show that the steady-state difference between the maximum and the minimum queue lengths stays bounded in probability (in the heavy traffic parameter n). Funding: Financial support from the National Science Foundation [RTG Award DMS-2134107] is gratefully acknowledged. S. Banerjee received financial support from the National Science Foundation [NSF-CAREER Award DMS-2141621]. A. Budhiraja received financial support from the National Science Foundation [Grant DMS-2152577].more » « less
-
Dispatching systems, where arriving jobs are immediately assigned to one of multiple queues, are ubiquitous in computer systems and service systems. A natural and practically relevant model is one in which each queue serves jobs in FCFS (First-Come First-Served) order. We consider the case where the dispatcher is size-aware, meaning it learns the size (i.e. service time) of each job as it arrives; and state-aware, meaning it always knows the amount of work (i.e. total remaining service time) at each queue. While size- and state-aware dispatching to FCFS queues has been extensively studied, little is known about optimal dispatching for the objective of minimizing mean delay. A major obstacle is that no nontrivial lower bound on mean delay is known, even in heavy traffic (i.e. the limit as load approaches capacity). This makes it difficult to prove that any given policy is optimal, or even heavy-traffic optimal. In this work, we propose the first size- and state-aware dispatching policy that provably minimizes mean delay in heavy traffic. Our policy, called CARD (Controlled Asymmetry Reduces Delay), keeps all but one of the queues short, then routes as few jobs as possible to the one long queue. We prove an upper bound on CARD's mean delay, and we prove the first nontrivial lower bound on the mean delay of any size- and state-aware dispatching policy. Both results apply to any number of servers. Our bounds match in heavy traffic, implying CARD's heavy-traffic optimality. In particular, CARD's heavy-traffic performance improves upon that of LWL (Least Work Left), SITA (Size Interval Task Assignment), and other policies from the literature whose heavy-traffic performance is known.more » « less
An official website of the United States government

